Given the increase in popularity of technologies such as Snowflake, SQL is more prevalent than ever. That’s excellent news; SQL makes data manipulation a lot easier. Still, a lot of practitioners seem to perceive it as an outdated language. This blog post motivates SQL as the language for data science and provides some practical tips for using it effectively.
Why use SQL over Python/R?
SQL is a declarative language. Instead of writing a recipe with exact steps (like we do in imperative languages such as Python or R), we describe what we want to accomplish. The underlying SQL engine decides how to achieve it.
This decoupling from the what and how has a lot of benefits for us. We can focus on writing human-readable code and let the engine figure out how to compute it and manage resources. In contrast, when writing Python/R, we have to carefully craft our code to ensure it doesn’t run out of memory.
Another huge benefit is that more people know SQL than Python/R. As data scientists, we must work with business stakeholders, who are more likely to understand SQL; sitting down with someone who knows the business logic to help us write the query to pull the appropriate data is priceless.
How much SQL and how much Python/R?
While SQL offers a scalable and simple solution for data manipulation, we often have to use a second language (e.g., making charts or training a model). However, I’ve encountered cases where data scientists dump an entire SQL table and manipulate data using Python/R. A more scalable approach is to leverage the SQL engine as much as possible and only move to Python/R once our dataset is in a reasonable size.
I’ve done ML projects where everything but the training step is SQL: pulling the data, cleaning, and feature engineering; using SQL allows me to create lighting-fast scalable pipelines that process TBs of data without worrying about memory management.
A typical SQL query
I often have to use other people’s SQL queries as a starting point for my projects; most of the time, I get something like this:
SELECT a.id, b.*
FROM users a, (SELECT id, COUNT(*) n_interactions FROM interactions GROUP BY id) b,
WHERE a.id = b.id
AND a.created >= '2021-01-01'
Understanding this query requires more effort than it should:
WHERE
is joining and filtering data- I don’t know what the output columns are (due to the
b.*
) - The aliases (
a
andb
) requires me to move back and forth between column selection and the alias declaration - There is no explanation about the
a.created > ='2021-01-01'
filtering condition
The previous snippet is the kind of code we want to avoid. Let’s now see how we can make the same query more readable.
Clean SQL
The first piece of advice is to organize our query such that every part does one thing. We can achieve this using common table expressions (aka CTEs):
WITH new_users AS (
SELECT id
FROM users
WHERE created >= '2021-01-01'
),
count_interactions AS (
SELECT id,
COUNT(*) n_interactions
FROM interactions
GROUP BY id
),
interactions_by_new_users AS (
SELECT id,
n_interactions
FROM new_users
LEFT JOIN count_interactions USING (id)
)
SELECT *
FROM interactions_by_new_users
With one glance, we can see that there are three independent parts:
- Getting new users
- Counting interactions
- Getting interactions of new users
Since each CTE has a descriptive name, we can quickly understand what this code does without even looking at the details. There isn’t any reason not to use CTEs; they make our code more readable, and usually do not incur any performance overhead.
Documenting your SQL
Note that the previous script doesn’t contain any comments. Keeping documentation up-to-date is hard; I recommend self-documenting your code with meaningful CTE names. However, it’s good practice to have a comment at the top to provide a high-level description.
/*
Counts the number of interactions of new users
*/
/*
Actual SQL code here...
*/
Avoid using comments that do not provide any extra information. A good use for comments is to document business logic:
SELECT id
FROM users -- the business unit required us to run the analysis for >=2021
WHERE created >= '2021-01-01'
The previous comment gives information that otherwise would be impossible to know. A good heuristic to know where to add comments to a query is to ask yourself: If I read this query 3 months from now, what parts would be the most surprising?
Auto-formatting
If you use Python, you may have heard of formatters like black of yapf. The premise of formatting is that the intelligent use of whitespace improves code readability. Take a look at the following query:
WITH new_users AS (
SELECT id FROM users WHERE created >= '2021-01-01'
), count_interactions AS (
SELECT id, COUNT(*) n_interactions FROM interactions GROUP BY id
), interactions_by_new_users AS (
SELECT id, n_interactions FROM new_users LEFT JOIN count_interactions USING (id)
)
SELECT * FROM interactions_by_new_users
Now take a look at the same query after we auto-formatted it:
WITH new_users AS (
SELECT id
FROM users
WHERE created >= '2021-01-01'
),
count_interactions AS (
SELECT id,
COUNT(*) n_interactions
FROM interactions
GROUP BY id
),
interactions_by_new_users AS (
SELECT id,
n_interactions
FROM new_users
LEFT JOIN count_interactions USING (id)
)
SELECT *
FROM interactions_by_new_users
Given how simple it is to use them, there is no reason not to use an auto-formatter. There are plug-ins available for all popular text editors and IDEs. If you use VS Code, check out these options.
Rules for writing clean SQL scripts
Here’s a summary of my rules for writing SQL:
- Break down logic in CTEs using
WITH ... AS
- The
SELECT
statement inside each CTE must do a single thing (join, filter or aggregate) - The CTE name should provide a high-level explanation
- The last statement should be a
SELECT
statement querying the last CTE - Use
WHERE
for filtering, not for joining - Favor
LEFT JOIN
overINNER JOIN
; in most cases, it’s essential to know the distribution ofNULL
s - When doing equijoins (i.e., joins where all conditions have the
something=another
form), use theUSING
keyword - Use aliases only when table names are long enough so that using them improves readability (but choose meaningful aliases)
- Do not use
SELECT *
. Explicitly list columns instead - Use comments to document business logic
- A comment at the top should provide a high-level description
- Use an auto-formatter
Debugging
Debugging SQL is notoriously hard. Unlike Python/R where we can start a debugging session at any given line, execute statements step-by-step, and review partial results, we cannot do so with SQL. CTEs come in handy to debug large queries. Since we already broke down logic in small steps, we can substitute the last SELECT
to preview partial results:
WITH new_users AS (
...
),
count_interactions AS (
...
),
interactions_by_new_users AS (
...
)
SELECT *
-- replace with any of the CTE names to see partial results!
FROM interactions_by_new_users
To speed up debugging, you can modify the CTEs that query raw tables/views. But make sure you do so in a consistent way:
WITH new_users AS (
SELECT id
FROM users
WHERE created >= '2021-01-01'
-- subset by getting users with ids 1, 2 and 3
AND id IN (1, 2, 3)
),
count_interactions AS (
SELECT id,
COUNT(*) n_interactions
FROM interactions
GROUP BY id
-- ensure we subset the same ids
AND id IN (1, 2, 3)
),
interactions_by_new_users AS (
SELECT id,
n_interactions
FROM new_users
LEFT JOIN count_interactions USING (id)
)
SELECT *
FROM interactions_by_new_users
As a final debugging recommendation: pay special attention to JOIN
s. In my experience, this is where most errors are.
Templating with jinja
One of the main limitations of SQL is the lack of code abstractions. When writing Python/R, we can write functions to encapsulate and re-use logic. While some SQL engines support some notion of abstractions, there isn’t a standard implementation, and to be honest, we don’t have time to learn the details of each SQL engine we use.
A portable approach is to use a templating language such as jinja. Templating languages help web developers write HTML on the fly, and more recently, they’ve been adopted to do so with SQL.
Suppose you’re working on a team project, and we must ensure that everyone is analyzing the same sub-population. It’s hardly the case that you’ll get the data extraction logic right on the first try; most likely, you’ll go through several iterations until every condition you need.
To ensure everyone uses the same query, you may define a jinja macro like this:
-- macros.sql
{% macro select_population() -%}
SELECT column, another
FROM users
WHERE signup_date >= '2018-01-01'
-- business logic continues...
AND ...
{%- endmacro %}
Subsequent queries can embed the previous macro like this:
{% import 'macros.sql' as macros %}
WITH population AS (
-- this will be replaced by the actual SQL query
{{ macros.select_population() }}
), interactions_per_user AS (
...
)
Upon rendering, jinja will replace {{macros.select_population()}}
with the corresponding SELECT
statement.
This is one example of how templating can help you create more maintainable SQL codebases; however, do not overdo this. Embedding too much jinja logic makes queries unreadable, and you may end up struggling to get jinja to produce valid SQL code. For more information, check out jinja’s documentation.
Building SQL pipelines
As you explore the data, it’s common to revisit old queries to tweak them. To avoid re-executing every script whenever you make a small change, you can break down your analysis into multiple files, each one generating a table:
# script to create the first table
users-get.sql
# script to create the second table (uses first table as input)
users-aggregate.sql
Say you are working on a new project and are still in the data exploration phase. If you make a few changes to users-aggregate.sql
, you can skip running users-get.sql
since it will generate the same output anyway. As your pipeline grows, this approach can save you a lot of time; but it becomes impractical to manually track script dependencies once you have more than a few tasks.
To keep a fast and iterative workflow, you can use Ploomber, which keeps track of past executions and figures out what scripts to skip (Click here to see an example).
Testing data pipelines
Two factors can break your analysis:
- Source data changes (e.g., schema change)
- Incorrect code (e.g., joining on the wrong column)
The strategy to protect against those scenarios is different. In the first case, you want to ensure that your pipeline only runs when input data has some expected properties. Otherwise, your code may crash or execute but yield incorrect results. The second scenario implies that your code is not doing what you are expecting it to do.
Integration testing: Protecting against source data changes
Say you want to compute the average interactions per user in the last month. Your user_interactions
table might look like this:
user_id | interaction_date |
---|---|
1 | 2021-01-03 |
1 | 2021-01-10 |
2 | 2021-01-02 |
You create a first script that counts interactions per user:
CREATE TABLE interactions_per_user_add_inactive AS (
WITH user_interactions_last_month AS (
SELECT *
FROM user_interactions
WHERE interaction_date >= '2021-01-01'
),
interactions_per_user AS (
SELECT user_id,
count(*) AS n_interactions
FROM user_interactions_last_month
GROUP BY user_id
),
interactions_per_user_add_inactive AS (
-- add users that had 0 interactions
SELECT user_id,
n_interactions
FROM users
LEFT JOIN interactions_per_user USING (user_id)
)
SELECT *
FROM interactions_per_user_add_inactive
)
Then, you compute the average interactions per user with:
SELECT AVG(n_interactions)
FROM interactions_per_user_add_inactive
There’s an error in this code. Can you see why?
Note the following excerpt from the previous script:
interactions_per_user_add_inactive AS (
-- add users that had 0 interactions
SELECT user_id,
n_interactions
FROM users
LEFT JOIN interactions_per_user USING (user_id)
)
Even if some users didn’t have interactions we need to count them. That’s the purpose of using a LEFT JOIN
. However, such query yields NULL
for users with no interactions, but we need zeros. These kinds of silent bugs happen often, and they often go unnoticed for a long time.
The idea of integration testing is to ensure that a given input has certain properties for our code can process correctly. In our case, before computing the average interactions per user, we want to check that there are no NULL
s in such a column. Integration tests are sort of stop checks that verify the quality of our data. This is how a bash script that runs integration tests would look like:
bash create_table_a.sh
# check properties of table a
bash test_table_a.sh
bash create_table_b.sh
# check properties of table b
bash test_table_b.sh
bash join_table_a_and_b.sh
# check properties of joined table
bash test_join_table.sh
Here are some common properties to test:
- Column has no nulls
- Column has unique values (ID column, usually the observation unit)
- Column has this subset of values (categorical data)
- Column falls within this range (numerical data)
- Date bounds
Ploomber offers support for running a function upon task execution. You can include integration tests in a SQL pipeline like this:
# pipeline.yaml
tasks:
- source: create_table_a.sql
product: [table_a, table]
# run function table_a from test.py after executing create_table_a.sql
on_finish: tests.table_a
Beyond integration testing: monitoring
Integration tests are a simple way of testing your data assumptions, but they’re limited in scope. Imagine the average interactions per user looks like this:
month | avg_interactions_user |
---|---|
January | 3 |
February | 6 |
March | 10 |
April | 100 |
That sudden increase in April looks suspicious. While you could write an integration test that checks sharp increases like this, it will require a lot of hand-tuning. Data monitoring is like adaptive integration testing: the validation logic incorporates historical data to flag issues.
My recommendation is to start doing integration testing and slowly integrate data monitoring. If you’re interested in knowing how this works, check out this post by Uber Engineering.
Unit testing: Protecting against incorrect code
When unit testing, we pass a pre-canned input and evaluate whether the output is what we expect. For example, we may assess a sign_up_user(email='hello@example.com')
function by checking if it returns a success message. We can apply the same reasoning to test SQL scripts; however, setup is considerably more complex, so most SQL goes untested.
If your pipeline only involves filtering, joining, and aggregation, integration testing goes a long way. On the other hand, if you’re building a Machine Learning pipeline and are writing feature engineering in SQL, your queries will likely go well beyond a GROUP BY
. For example, you might be aggregating data over different time windows and comparing them to each other (e.g., the increase of user activity from the previous week vs. the current week). In such cases, it’s crucial to consider unit testing. Here are a few tips:
Getting test data
The first thing you have to do is to get some testing data. The easiest thing to do is to query your warehouse for some records, but you have to ensure that you always fetch the same data to make your tests deterministic (i.e., use the primary key).
Alternatively, you can generate some mock data. Using Python, you can easily do this with the pandas library: create a data frame with simulated data and then export it using the .to_sql
method.
Parametrizing queries
To prevent overwriting your real tables. You can parametrize your SQL files to switch them when runinng tests:
CREATE TABLE {{table_to_create}} AS (
SELECT * FROM {{table_to_read}}
-- sql continues...
)
Then in your testing function:
from jinja2 import Template
def test_sql_script():
# store synthetic data in a table named "synthetic-data"
# read template...
template = Template(Path('my-script.sql').read_text())
sql = template.render(table_to_read='synthetic-data',
table_to_create='test_table')
# execute the query...
# read results...
# check if output is what you expect (you define this logic)
assert created_table == expected_result
Mocking the data warehouse
Making your tests independent of the data warehouse can help you make them faster and simplify setup (e.g., continuous integration). Using SQLite is an excellent choice for unit testing if you’re not using any engine-specific features in your SQL queries. This way, your tests won’t make any actual calls to the data warehouse.
Concluding thoughts
SQL is a fantastic language. I’ve seen data scientists jump straight into distributed computing tools such as Spark without even considering using the data warehouse as a computing tool. While SQL comes with its own set of challenges, the benefits highly outweigh them: it’s easier to learn SQL than to maintain a Spark cluster.
I hope you found this post informative and valuable for your upcoming projects. Of course, nothing is set in stone; if you have any recommendations for improving SQL workflows please reach out!