preloader

JupySQL: Better SQL in Jupyter

author image

TL;DR; we forked ipython-sql (pip install jupysql) and are actively developing it to bring a modern SQL experience to Jupyter! We’ve already built some great features, such as SQL query composition and plotting for large-scale datasets!

A few months after I started my career in Data Science, I encountered the ipython-sql package (which enables you to write cells like %sql SELECT * FROM table), and I was amazed by it. It allowed me to write SQL cells, making my notebooks much cleaner since I no longer needed to wrap each SQL query into a Python call. Furthermore, it also led me to learn that Jupyter (known as IPython back then) runs a modified Python interpreter, which enables all kinds of fun stuff like “magics” (e.g., %%bash, %%capture, etc.)

Fast forward, the Ploomber team has been busy helping dozens of companies use notebooks more effectively: from notebook profiling, unit testing, experiment tracking, and notebook deployment, we’ve seen hundreds of use cases. One of the most recurrent ones is SQL-heavy notebooks, used for scheduled reporting: a data practitioner writes some SQL in a notebook that produces some plots. Then, it is turned into a scheduled job to deliver results daily/weekly.

Many of those notebooks we’ve seen use the pandas.read_sql (or a variant of it) for running SQL queries. Whenever we saw this, we’d recommend a switch to ipython-sql, to simplify notebooks. However, in some cases, this would lead to some limitations and notebooks would end up with a mix of %%sql and pandas.read_sql calls. Since we love the %%sql abstraction, we decided to fork the repository and speed up development to bring the best of SQL to the notebooks world! We’re thankful to the original maintainers and contributors for such a great package.

We’ve released several new features already: better connection management, composing large SQL features, plotting large-scale datasets, among others. So let’s see what’s new!

Setup

We wrote this blog post in Jupyter (thanks to our Jupyblog package), so let’s start by installing and configuring a few things. We’ll be using DuckDB and the penguins dataset for the examples.

%pip install jupysql duckdb-engine --quiet
%load_ext sql
%config SqlMagic.displaylimit = 5
%sql duckdb://

Console output (1/1):

Note: you may need to restart the kernel to use updated packages.
from pathlib import Path
from urllib.request import urlretrieve

if not Path("penguins.csv").is_file():
    urlretrieve("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
                "penguins.csv")

Great, we’re ready to start querying our data.

Querying data

JupySQL is fully compatible with notebooks authored using ipython-sql; all arguments and configuration settings remain the same.

There is the usual line magic:

%sql SELECT * FROM penguins.csv LIMIT 3

Console output (1/2):

*  duckdb://
Done.

Console output (2/2):

speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
AdelieTorgersen39.118.71813750MALE
AdelieTorgersen39.517.41863800FEMALE
AdelieTorgersen40.318.01953250FEMALE

And a cell magic:

%%sql
SELECT *
FROM penguins.csv
LIMIT 3

Console output (1/2):

*  duckdb://
Done.

Console output (2/2):

speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
AdelieTorgersen39.118.71813750MALE
AdelieTorgersen39.517.41863800FEMALE
AdelieTorgersen40.318.01953250FEMALE

Switching connections

We improved support for managing multiple connections via a new --alias argument. Before that, users relied on the connection string to switch and close connections. Let’s create a new connection to an in-memory SQLite database:

%sql sqlite:// --alias second-db

We see the new connection is listed:

%sql

Console output (1/1):

duckdb://
*  (second-db) sqlite://

And we can close it using the alias:

%sql --close second-db
%sql

Console output (1/1):

duckdb://

Composing large SQL queries

A recurrent pattern we’ve seen is notebooks that use Jinja/templates for writing large queries (if you’re a dbt user, you know what I’m talking about). In JupySQL, you can write short queries, and we’ll automatically convert them into CTEs; let’s see an example and define a query that subsets our data:

%%sql duckdb:// --save adelie
SELECT *
FROM penguins.csv
WHERE species = 'Adelie'

Console output (1/2):

Done.

Console output (2/2):

speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
AdelieTorgersen39.118.71813750MALE
AdelieTorgersen39.517.41863800FEMALE
AdelieTorgersen40.318.01953250FEMALE
AdelieTorgersenNoneNoneNoneNoneNone
AdelieTorgersen36.719.31933450FEMALE
152 rows, truncated to displaylimit of 5

Now, let’s say that we want to build on top of the previous data subset; we could manually copy our last query and embed it in a new cell with a CTE, something like this:

WITH adelie AS (
    SELECT *
    FROM penguins.csv
    WHERE species = 'Adelie'
)

{your query here}

However, this has many problems:

  • We’re duplicating logic since the same query exists multiple times
  • We cannot preview each part of the CTE, which makes debugging a lot harder

To fix this, JupySQL brings a new --with argument that will automatically build the CTE for us:

%%sql --with adelie
SELECT *
FROM adelie
WHERE island = 'Torgersen'

Console output (1/2):

*  duckdb://
Done.

Console output (2/2):

speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
AdelieTorgersen39.118.71813750MALE
AdelieTorgersen39.517.41863800FEMALE
AdelieTorgersen40.318.01953250FEMALE
AdelieTorgersenNoneNoneNoneNoneNone
AdelieTorgersen36.719.31933450FEMALE
52 rows, truncated to displaylimit of 5

In the background, JupySQL runs the following:

WITH adelie AS (
    SELECT *
    FROM penguins.csv
    WHERE species = 'Adelie'
)

SELECT *
FROM adelie
WHERE island = 'Torgersen'

Plotting large-scale datasets with %sqlplot

SQL notebooks almost always contain visualizations that allow data practitioners to understand their data. However, we saw two significant limitations when creating plots:

  1. When plotting data stored in a data warehouse, many notebooks fetched entire columns (or most of them) since matplotlib (and seaborn) require all data to exist locally for plotting. This approach made notebooks extremely inefficient and incurred high data transfer costs.
  2. Practitioners will often run out of memory since loading big datasets for plotting using matplotlib would start accumulating intermediate results into memory.

We created a new %sqlplot magic to fix this issue. %sqlplot performs most transformations in the SQL engine and only transfers the summary statistics required, drastically reducing data transfer and allowing practitioners to leverage the highly optimized SQL engines that data warehouses have. Furthermore, it enables them to plot larger-than-memory .csv and .parquet files via DuckDB. Let’s see an example.

First, let’s store a query that filters out NULLs:

%%sql --save not-nulls --no-execute
SELECT *
FROM penguins.csv
WHERE bill_length_mm IS NOT NULL
AND bill_depth_mm IS NOT NULL

Console output (1/1):

*  duckdb://
Skipping execution...

Now, we create a boxplot. A boxplot requires computing percentiles and a few other statistics. Those are all computed in the SQL engine (in this case, DuckDB). Then we use those statistics to create the plot:

%sqlplot boxplot --column bill_length_mm bill_depth_mm --table not-nulls --with not-nulls

Console output (1/1):

26-1

We can also create a histogram. All the data is aggregated in DuckDB (or the data warehouse), and only the bins and counts are retrieved:

%sqlplot histogram --column bill_length_mm bill_depth_mm --table not-nulls --with not-nulls

Console output (1/1):

28-1

Custom engines

Often, users want to customize database connections to bring extra features. ipython-sql only allows establishing connections via a connection string, severely limiting adoption. In JupySQL, you may pass existing connections. Let me show you two example use cases.

Use case: querying in-memory data frames

If using DuckDB, you can create a connection, register existing data frames, pass the customized engine, and query them with SQL:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("duckdb:///:memory:")
engine.execute("register", ("df", pd.DataFrame({"x": range(100)})))

Console output (1/1):

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x11ddbd9a0>
%sql engine
%%sql
SELECT *
FROM df
WHERE x > 95

Console output (1/2):

duckdb://
*  duckdb:///:memory:
Done.

Console output (2/2):

x
96
97
98
99

Use case: registering SQLite UDFs

If using SQLite, passing a custom connection allows you to register user-defined functions (UDFs):

from sqlalchemy import create_engine
from sqlalchemy import event

def mysum(x, y):
    return x + y

engine_sqlite = create_engine("sqlite://")

@event.listens_for(engine_sqlite, "connect")
def connect(conn, rec):
    conn.create_function(name="MYSUM", narg=2, func=mysum)
%sql engine_sqlite
%%sql
SELECT MYSUM(1, 2)

Console output (1/2):

duckdb://
   duckdb:///:memory:
*  sqlite://
Done.

Console output (2/2):

MYSUM(1, 2)
3

Interactive documentation

We’ve completely revamped the documentation using Jupyter Book and following the documentation system. You can find a Quick Start section, a detailed User Guide, and the technical API reference. Furthermore, all our tutorials contain a 🚀 button that will start a temporary JupyterLab session for you to try it out!

The future

We’re stoked to be working on JupySQL. Its user base has been growing rapidly, cementing our conviction that the data community wants a better SQL experience in Jupyter. There are a few things we’re working on, and we’ll release soon, so stay tuned (we’re on Twitter and LinkedIn). If you want to share your feedback, feel free to open an issue on GitHub or send us a message on Slack!

Deploy Analytical dashboards with Ploomber

Recent Articles

Try Ploomber Cloud Now

Get Started
*