SQL essentials: Part 1

SQL is the analyst’s language of choice for handling structured data. Here's an introduction to some basic but essential SQL functions.
January 16, 2020

Relational databases and the languages used to query them date back to the 1970s. To this day, structured query language (SQL) remains the standard means for analysts to communicate with relational databases and their columnar, cloud-based counterparts: data warehouses.

This guide assumes you have a basic knowledge of SQL but will introduce several concepts, functions, and key pieces of syntax that you might not be aware of. In part 1 of this series, we will discuss the following essential topics:

  1. Beyond SELECT: The order of operations
  2. Aliasing tables
  3. Common table expressions
  4. Making Better JOINS

Beyond SELECT: The order of operations

The SELECT statement makes up the majority of the SQL you will write. It is therefore essential to understand fully, and is a common intro question during interviews to boot. In the context of the modern data pipeline and ELT, even CREATE and INSERT statements are often only used as wrappers around a SELECT statement.

Beyond that, it is important to understand roughly in what order these parts will be executed by the DB itself. This small step beyond the basics will serve you well as you go deeper into writing SQL.

Pay attention to the numbers in parentheses in the following query:

SELECT (_) DISTINCT (7) expression_list (3) window_functions (6) FROM (1) JOIN (1b) WHERE (2) GROUP BY (4) HAVING (5) ORDER BY (8) LIMIT (9)

Understanding that HAVING is necessarily later than WHERE solidifies the difference between the two. The WHERE clause tells the system to only extract individual records that meet certain criteria. The HAVE clause filters as well, but for records that have been aggregated by the GROUP BY clause. Recall that the GROUP BY clause groups rows with common properties together and then enables aggregate operations such as SUM and COUNT, which perform calculations over grouped rows, producing one output row.

Beyond the basics, understanding the fundamentals of ordering and execution will be useful when we encounter more difficult problems. It is particularly useful when performance considerations are relevant, and looking at data warehouse design, although these topics are both outside of the scope of this article.

A neat example of where knowledge of ordering matters is the case of WINDOW functions (which are covered in depth later in this article). Window functions are written in the SELECT list, but much of their calculation occurs after aggregation and aggregate filtering (HAVING). Essentially, Window functions are calculated after all other parts of the select expression, with the exception of ORDER BY and LIMIT. Snowflake has a lovely QUALIFY syntax, which allows filtering on WINDOW functions. QUALIFY is to ROW_NUMBER (a WINDOW function) as HAVING is to SUM (an aggregate function). Below is theoretical SQL whose composite parts are numbered in the order of execution.

SELECT DISTINCT -- (6) -- (3) user_id, order_id, revenue, FIRST_VALUE(order_id) OVER ( PARTITION BY user_id ORDER BY created_at ) AS first_order_id -- (4b) FROM orders -- (1) WHERE user_age > 20 -- (2) QUALIFY FIRST_VALUE(revenue) OVER ( PARTITION BY user_id ORDER BY created_at ) > 0 -- (4a) Only return results where a users first order was not free ORDER BY user_id ASC -- (5) LIMIT 100 -- (6)

In dialects other Snowflake, you would be forced to produce the following. One can see that the pseudo execution order doesn’t change much as a result of the added syntax (namely the subquery). Although cleaner to read roughly the same work must be done under the hood. :

With prep AS ( SELECT DISTINCT -- (6) -- (3) user_id, Order_id, Revenue, -- (4) FIRST_VALUE(order_id) OVER ( PARTITION BY user_id ORDER BY created_at ) AS first_order_id, FIRST_VALUE(revenue) OVER ( PARTITION BY user_id ORDER BY created_at ) AS first_order_value FROM orders -- (1) WHERE user_age > 20 -- (2) ) SELECT * -- (7) FROM prep -- (5) WHERE first_order_value > 0 -- (6) previously 4b ORDER BY user_id ASC -- (8) LIMIT 100 -- (9)

For a more in-depth treatment of window functions, see Part 2 of this series.

Aliasing tables

Aliasing is about assigning new names to tables, columns and expressions. It is often underused - or even omitted entirely - in the case of table aliases. Although analysts tend to understand early on the importance of aliasing columns or expressions in the SELECT list, the benefits of table aliases for readability are easily overlooked. It’s surprisingly common for even seasoned vets to use full table names rather than aliases, and to omit aliases altogether in places where the parser can figure it out.

Although the only problem with the former is its long-windedness, the latter is difficult to read and can therefore lead to errors. Take the following example:

  • Users
  • Id
  • Name
  • Orders
  • Order_id
  • User_id
  • Status
  • Revenue

SELECT user_id, Name, Status, SUM(Revenue) AS total_revenue FROM orders LEFT JOIN users ON id = user_id GROUP BY 1,2,3

Although this query will run in its current condition, it faces two problems:

1) New columns are being joined to source tables. For example if ‘status’ is added to ‘users’, (for example, to show if they have completed signup or not) there will now be an error introduced, as ‘status’ is ambiguous between ‘users’ and ‘orders’. Although in a single query this isn’t hard to track down, real-world situations are less forgiving. Across a number of larger, scheduled queries, filled with joins and complexity, this type of situation can be a nightmare.

2) The above query contains both “name” and “user_id”. If, later on, “name” is perhaps no longer needed (perhaps it was only included for debugging purposes in the first place) it might be removed from the select list. This leaves the “LEFT JOIN users” statement superfluous. The above occurs quite often, creating ‘orphan joins’ as a result. In the above, we are dealing with only 4 columns and 2 tables, and hence the unneeded join is easy to track down. However, in a more real world example (containing, perhaps, hundreds of columns) each un-aliased column increases the risk of an orphan join. Since those professionals in the data-team (unlike the parser) don’t know all columns by heart, it is often safer to leave joins in if one cannot be sure they are unneeded. However, each such table necessarily joined adds to the length of code reduces readability, and increases the chance of collision as per point one.

The following alternative solves both of the above problems and makes the code easier to understand at a glance:

SELECT O.user_id, -- if the “name” column is omitted we can see “LEFT JOIN users” is not needed u.name, o.status, SUM(o.revenue) AS total_revenue FROM orders AS o LEFT JOIN users AS u ON u.id = o.user_id GROUP BY 1,2,3

Golden Rule: If there is more than one table, alias every column reference

Short table aliases are typically very readable, but for extra usability, you should agree on common aliases for common tables in your team.

Common table expressions

Common Table Expressions (CTE) are another SQL syntax essential for all analysts. Subqueries can create a sprawling, nested mess that is almost impossible to read. CTE’s offer a fresh alternative to this. They allow you to write queries that can be read linearly from top to bottom, allowing any member of the team to quickly understand your route (and therefore your thought process) as you move from tables to outputs. CTEs can also provide recursive query patterns and cleaner code reuse, but by far the most useful aspect is simple readability. As a rule of thumb: always use a CTE over a subquery, and if you need to use a subquery, make sure it reads cleanly on one line.

WITH load_range AS ( SELECT MIN(event_at) AS start_at, MAX(event_at) AS end_at FROM raw_data.events_load ) SELECT *, (SELECT end_at FROM last_load) AS _elt_max_updated_at FROM raw_data.production_orders AS o LEFT JOIN raw_data.events_load AS e ON o.order_id = e.order_id WHERE o.created_at BETWEEN (SELECT start_at FROM load_range) AND (SELECT end_at FROM load_range)

Making better JOINs

JOINs are a quintessential SQL tool. However, there are two often overlooked JOIN properties that go beyond the basic INNER, LEFT AND FULL OUTER JOINS. They are the NATURAL JOIN and the USING keyword. Most of us get our introduction to JOINS using ON and often spend time working with production DBs that boast an “id” column in every table. For these reasons both USING and NATURAL fall by the wayside until we get deep into our clean Kimball modelled world of facts and dims (dimensions). Both concepts depend on the idea that the joining tables share columns of the identical name.

For example given the tables

  • dims.users
  • User_id
  • User_name
  • facts.orders
  • Date,
  • User_id,
  • Order_id,
  • Revenue

The following are all equivalent

FROM facts.orders AS o INNER JOIN dims.users AS u ON u.user_id = o.user_id

FROM facts.orders AS o INNER JOIN dims.users AS u USING(user_id)

FROM facts.orders AS o NATURAL INNER JOIN dims.users AS u

From top to bottom, each successive method surrenders more of the SQL writer’s control to the parsing engine. Although I prefer the explicit nature of the ON join (especially in SQL transformation code) both USING and NATURAL are worth knowing. A nice use case is when two Fact tables are to be rolled up to a shared granularity before being joined.

WITH session_data AS ( SELECT DATE_TRUNC(‘week’,date) AS week, Age_bucket, SUM(sessions) AS sessions, SUM(revenue) AS revenue FROM facts.sessions ), Cost_data AS ( SELECT Week, SUM(cost) AS cost FROM facts.marketing_spend ) SELECT s.week, s.age_bucket, s.sessions, s.revenue, C.cost FROM session_data NATURAL LEFT JOIN cost_data

Conclusion

SELECT statements, aliases, common table expressions, and JOINs are all beginner concepts, yet feature many nuances that aren’t necessarily evident to casual users of SQL. Take these tips and tricks to heart and your SQL game will improve at a very fundamental level.

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data insights
Data insights

SQL essentials: Part 1

SQL essentials: Part 1

January 16, 2020
January 16, 2020
SQL essentials: Part 1
SQL is the analyst’s language of choice for handling structured data. Here's an introduction to some basic but essential SQL functions.

Relational databases and the languages used to query them date back to the 1970s. To this day, structured query language (SQL) remains the standard means for analysts to communicate with relational databases and their columnar, cloud-based counterparts: data warehouses.

This guide assumes you have a basic knowledge of SQL but will introduce several concepts, functions, and key pieces of syntax that you might not be aware of. In part 1 of this series, we will discuss the following essential topics:

  1. Beyond SELECT: The order of operations
  2. Aliasing tables
  3. Common table expressions
  4. Making Better JOINS

Beyond SELECT: The order of operations

The SELECT statement makes up the majority of the SQL you will write. It is therefore essential to understand fully, and is a common intro question during interviews to boot. In the context of the modern data pipeline and ELT, even CREATE and INSERT statements are often only used as wrappers around a SELECT statement.

Beyond that, it is important to understand roughly in what order these parts will be executed by the DB itself. This small step beyond the basics will serve you well as you go deeper into writing SQL.

Pay attention to the numbers in parentheses in the following query:

SELECT (_) DISTINCT (7) expression_list (3) window_functions (6) FROM (1) JOIN (1b) WHERE (2) GROUP BY (4) HAVING (5) ORDER BY (8) LIMIT (9)

Understanding that HAVING is necessarily later than WHERE solidifies the difference between the two. The WHERE clause tells the system to only extract individual records that meet certain criteria. The HAVE clause filters as well, but for records that have been aggregated by the GROUP BY clause. Recall that the GROUP BY clause groups rows with common properties together and then enables aggregate operations such as SUM and COUNT, which perform calculations over grouped rows, producing one output row.

Beyond the basics, understanding the fundamentals of ordering and execution will be useful when we encounter more difficult problems. It is particularly useful when performance considerations are relevant, and looking at data warehouse design, although these topics are both outside of the scope of this article.

A neat example of where knowledge of ordering matters is the case of WINDOW functions (which are covered in depth later in this article). Window functions are written in the SELECT list, but much of their calculation occurs after aggregation and aggregate filtering (HAVING). Essentially, Window functions are calculated after all other parts of the select expression, with the exception of ORDER BY and LIMIT. Snowflake has a lovely QUALIFY syntax, which allows filtering on WINDOW functions. QUALIFY is to ROW_NUMBER (a WINDOW function) as HAVING is to SUM (an aggregate function). Below is theoretical SQL whose composite parts are numbered in the order of execution.

SELECT DISTINCT -- (6) -- (3) user_id, order_id, revenue, FIRST_VALUE(order_id) OVER ( PARTITION BY user_id ORDER BY created_at ) AS first_order_id -- (4b) FROM orders -- (1) WHERE user_age > 20 -- (2) QUALIFY FIRST_VALUE(revenue) OVER ( PARTITION BY user_id ORDER BY created_at ) > 0 -- (4a) Only return results where a users first order was not free ORDER BY user_id ASC -- (5) LIMIT 100 -- (6)

In dialects other Snowflake, you would be forced to produce the following. One can see that the pseudo execution order doesn’t change much as a result of the added syntax (namely the subquery). Although cleaner to read roughly the same work must be done under the hood. :

With prep AS ( SELECT DISTINCT -- (6) -- (3) user_id, Order_id, Revenue, -- (4) FIRST_VALUE(order_id) OVER ( PARTITION BY user_id ORDER BY created_at ) AS first_order_id, FIRST_VALUE(revenue) OVER ( PARTITION BY user_id ORDER BY created_at ) AS first_order_value FROM orders -- (1) WHERE user_age > 20 -- (2) ) SELECT * -- (7) FROM prep -- (5) WHERE first_order_value > 0 -- (6) previously 4b ORDER BY user_id ASC -- (8) LIMIT 100 -- (9)

For a more in-depth treatment of window functions, see Part 2 of this series.

Aliasing tables

Aliasing is about assigning new names to tables, columns and expressions. It is often underused - or even omitted entirely - in the case of table aliases. Although analysts tend to understand early on the importance of aliasing columns or expressions in the SELECT list, the benefits of table aliases for readability are easily overlooked. It’s surprisingly common for even seasoned vets to use full table names rather than aliases, and to omit aliases altogether in places where the parser can figure it out.

Although the only problem with the former is its long-windedness, the latter is difficult to read and can therefore lead to errors. Take the following example:

  • Users
  • Id
  • Name
  • Orders
  • Order_id
  • User_id
  • Status
  • Revenue

SELECT user_id, Name, Status, SUM(Revenue) AS total_revenue FROM orders LEFT JOIN users ON id = user_id GROUP BY 1,2,3

Although this query will run in its current condition, it faces two problems:

1) New columns are being joined to source tables. For example if ‘status’ is added to ‘users’, (for example, to show if they have completed signup or not) there will now be an error introduced, as ‘status’ is ambiguous between ‘users’ and ‘orders’. Although in a single query this isn’t hard to track down, real-world situations are less forgiving. Across a number of larger, scheduled queries, filled with joins and complexity, this type of situation can be a nightmare.

2) The above query contains both “name” and “user_id”. If, later on, “name” is perhaps no longer needed (perhaps it was only included for debugging purposes in the first place) it might be removed from the select list. This leaves the “LEFT JOIN users” statement superfluous. The above occurs quite often, creating ‘orphan joins’ as a result. In the above, we are dealing with only 4 columns and 2 tables, and hence the unneeded join is easy to track down. However, in a more real world example (containing, perhaps, hundreds of columns) each un-aliased column increases the risk of an orphan join. Since those professionals in the data-team (unlike the parser) don’t know all columns by heart, it is often safer to leave joins in if one cannot be sure they are unneeded. However, each such table necessarily joined adds to the length of code reduces readability, and increases the chance of collision as per point one.

The following alternative solves both of the above problems and makes the code easier to understand at a glance:

SELECT O.user_id, -- if the “name” column is omitted we can see “LEFT JOIN users” is not needed u.name, o.status, SUM(o.revenue) AS total_revenue FROM orders AS o LEFT JOIN users AS u ON u.id = o.user_id GROUP BY 1,2,3

Golden Rule: If there is more than one table, alias every column reference

Short table aliases are typically very readable, but for extra usability, you should agree on common aliases for common tables in your team.

Common table expressions

Common Table Expressions (CTE) are another SQL syntax essential for all analysts. Subqueries can create a sprawling, nested mess that is almost impossible to read. CTE’s offer a fresh alternative to this. They allow you to write queries that can be read linearly from top to bottom, allowing any member of the team to quickly understand your route (and therefore your thought process) as you move from tables to outputs. CTEs can also provide recursive query patterns and cleaner code reuse, but by far the most useful aspect is simple readability. As a rule of thumb: always use a CTE over a subquery, and if you need to use a subquery, make sure it reads cleanly on one line.

WITH load_range AS ( SELECT MIN(event_at) AS start_at, MAX(event_at) AS end_at FROM raw_data.events_load ) SELECT *, (SELECT end_at FROM last_load) AS _elt_max_updated_at FROM raw_data.production_orders AS o LEFT JOIN raw_data.events_load AS e ON o.order_id = e.order_id WHERE o.created_at BETWEEN (SELECT start_at FROM load_range) AND (SELECT end_at FROM load_range)

Making better JOINs

JOINs are a quintessential SQL tool. However, there are two often overlooked JOIN properties that go beyond the basic INNER, LEFT AND FULL OUTER JOINS. They are the NATURAL JOIN and the USING keyword. Most of us get our introduction to JOINS using ON and often spend time working with production DBs that boast an “id” column in every table. For these reasons both USING and NATURAL fall by the wayside until we get deep into our clean Kimball modelled world of facts and dims (dimensions). Both concepts depend on the idea that the joining tables share columns of the identical name.

For example given the tables

  • dims.users
  • User_id
  • User_name
  • facts.orders
  • Date,
  • User_id,
  • Order_id,
  • Revenue

The following are all equivalent

FROM facts.orders AS o INNER JOIN dims.users AS u ON u.user_id = o.user_id

FROM facts.orders AS o INNER JOIN dims.users AS u USING(user_id)

FROM facts.orders AS o NATURAL INNER JOIN dims.users AS u

From top to bottom, each successive method surrenders more of the SQL writer’s control to the parsing engine. Although I prefer the explicit nature of the ON join (especially in SQL transformation code) both USING and NATURAL are worth knowing. A nice use case is when two Fact tables are to be rolled up to a shared granularity before being joined.

WITH session_data AS ( SELECT DATE_TRUNC(‘week’,date) AS week, Age_bucket, SUM(sessions) AS sessions, SUM(revenue) AS revenue FROM facts.sessions ), Cost_data AS ( SELECT Week, SUM(cost) AS cost FROM facts.marketing_spend ) SELECT s.week, s.age_bucket, s.sessions, s.revenue, C.cost FROM session_data NATURAL LEFT JOIN cost_data

Conclusion

SELECT statements, aliases, common table expressions, and JOINs are all beginner concepts, yet feature many nuances that aren’t necessarily evident to casual users of SQL. Take these tips and tricks to heart and your SQL game will improve at a very fundamental level.

Topics
No items found.
Share

Related blog posts

No items found.
No items found.
No items found.

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.