Running SQL queries on Jupyter notebook

SQL (Structured Query Language) is a programming language designed for managing data in a relational database. It has a variety of functions that allow its users to read, manipulate, and update data. SQL is also a great tool for performing aggregations like sums, counts, minimums and maximums, etc., over much larger datasets and on multiple tables at the same time. Data professionals often need to run SQL queries on Jupyter Notebook for various data analysis tasks.

In this tutorial, we’ll start looking at some simpler approaches to running SQL queries on Jupyter Notebooks, and then move on to more complex approaches.

JupySQL

JupySQL is a fork of ipython-sql that allows users to run SQL queries directly from a Jupyter Notebook in a cleaner way. This library eliminates the need to write multiple lines of code to connect to the database, or wrap the query in a string as in the other approaches. This library also comes with a lot of new features like:

  • A plotting module that allows you to efficiently plot massive datasets without running out of memory.

  • Feature to break queries into multiple cells with the help of CTEs.

  • Compatible with all major databases (e.g., PostgreSQL, MySQL, SQL Server), data warehouses (e.g., Snowflake, BigQuery, Redshift), and embedded engines (SQLite, and DuckDB).

Click here to learn more about the features of JupySQL.

Install JupySQL using the below command:

%pip install jupysql --quiet
Note: you may need to restart the kernel to use updated packages.

JupySQL allows you to run SQL in Jupyter/IPython via %sql and %%sql magics.

%load_ext sql
%%sql sqlite://
CREATE TABLE languages (name, rating, change);
INSERT INTO languages VALUES ('Python', 14.44, 2.48);
INSERT INTO languages VALUES ('C', 13.13, 1.50);
INSERT INTO languages VALUES ('Java', 11.59, 0.40);
INSERT INTO languages VALUES ('C++', 10.00, 1.98);
Click to view output Done.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
[]

%sql SELECT * FROM languages
*  sqlite://
Done.
nameratingchange
Python14.442.48
C13.131.5
Java11.590.4
C++10.01.98

For a more elaborate example using DuckDB click here.

JupySQL also allows users to easily convert query results into Pandas dataframe:

result = %sql SELECT * FROM languages WHERE rating > 10
*  sqlite://
Done.

Pandas

Pandas is a popular open-source library widely used for data analysis. It provides fast, flexible, and expressive data structures designed to make working with relational or labelled data both easy and intuitive. Both Pandas and SQL are important tools used in the data analysis tech stack. Pandas can be very quickly used in simple environments like notebooks. While SQL can be used for querying the data warehouse or a distributed database, Pandas is a better tool for performing complicated data transformation and Exploratory data analysis.

In this section, we’ll learn how to create a table and query the database using the to_sql and read_sql functions provided by Pandas. This will be demonstrated on Seaborn’s tips dataset. First, install the below packages:

%pip install pandas seaborn pip install memory-profiler --quiet
Note: you may need to restart the kernel to use updated packages.

Now let’s use the to_sql function for saving the dataset in the database as a table.

import seaborn as sb

df = sb.load_dataset('tips')
conn = "sqlite:///example"
df.to_sql(
            'tips',            
            conn,                
            if_exists='replace'
           )
244

Let’s use the read_sql function to filter out some columns from this table.

import pandas as pd
query_result = pd.read_sql("""
            SELECT total_bill, tip, sex
            FROM tips
            WHERE time='Dinner'
            """, con=conn)
query_result.head()

total_billtipsex
016.991.01Female
110.341.66Male
221.013.50Male
323.683.31Male
424.593.61Female

Running read_sql queries on large datasets can easily lead to out-of-memory issues. This function supports a chunksize argument, however, this argument is only used for creating chunks of dataframes to be returned to the user. By default, it will fetch all data into memory at once before chunking unless the database driver has a feature to support server-side-cursors.

Let’s add the above code to a Python script named read_tips.py and run a memory profiler to see the memory usage:

!python -m memory_profiler read_tips.py
Filename: read_tips.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     4  107.641 MiB  107.641 MiB           1   @profile
     5                                         def profile():
     6  108.516 MiB    0.875 MiB           1       df = sb.load_dataset('tips')
     7  108.516 MiB    0.000 MiB           1       conn = "sqlite:///example"
     8  117.969 MiB    9.453 MiB           2       df.to_sql(
     9  108.516 MiB    0.000 MiB           1                   'tips',            
    10  108.516 MiB    0.000 MiB           1                   conn,                
    11  108.516 MiB    0.000 MiB           1                   if_exists='replace'
    12                                                        )
    13                                         
    14  118.000 MiB    0.031 MiB           2       query_result = pd.read_sql("""
    15                                                         SELECT total_bill, tip, sex
    16                                                         FROM tips
    17                                                         WHERE time='Dinner'
    18  117.969 MiB    0.000 MiB           1                   """, con=conn)

Now set the chunksize parameter and check that the memory usage remains the same:

!python -m memory_profiler read_tips_chunksize.py
Filename: read_tips_chunksize.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     4  110.109 MiB  110.109 MiB           1   @profile
     5                                         def profile():
     6  111.062 MiB    0.953 MiB           1       df = sb.load_dataset('tips')
     7  111.062 MiB    0.000 MiB           1       conn = "sqlite:///example"
     8  122.609 MiB   11.547 MiB           2       df.to_sql(
     9  111.062 MiB    0.000 MiB           1                   'tips',            
    10  111.062 MiB    0.000 MiB           1                   conn,                
    11  111.062 MiB    0.000 MiB           1                   if_exists='replace'
    12                                                        )
    13                                         
    14  122.625 MiB    0.016 MiB           2       query_result = pd.read_sql("""
    15                                                         SELECT total_bill, tip, sex
    16                                                         FROM tips
    17                                                         WHERE time='Dinner'
    18  122.609 MiB    0.000 MiB           1                   """, con=conn, chunksize=50)

Database drivers

Database drivers like psycopg2 and sqlite3 can also be directly used for running SQL queries from notebooks. Let’s see a simple example using sqlite3:

First, we need to create a new database and call sqlite3.connect() to create a connection to the database tutorial.db in the current working directory.

import sqlite3
con = sqlite3.connect("tutorial.db")

The con represents the connection to the on-disk database. We will also need to use a database cursor to execute SQL statements.

cur = con.cursor()
cur.execute("CREATE TABLE movie(title, year, score)")
cur.execute("""
    INSERT INTO movie VALUES
        ('Monty Python and the Holy Grail', 1975, 8.2),
        ('And Now for Something Completely Different', 1971, 7.5)
""")
con.commit()
res = cur.execute("SELECT score FROM movie")
res.fetchall()
[(8.2,), (7.5,)]

SQLAlchemy

SQLAlchemy is a popular SQL toolkit and Object Relational Mapper (ORM). ORM is a technique that lets application developers query and manipulates data from a database using the object-oriented paradigm. SQLAlchemy translates Python classes to tables on relational databases and allows developers to create database-agnostic code to communicate with various database engines. It also abstracts boilerplate code like handling database connections.

Let’s see an example of using SQLAlchemy to create tables in a database and query records from the tables.

Install SQLAlchemy from the notebook using the below command:

%pip install SQLAlchemy --quiet
Note: you may need to restart the kernel to use updated packages.

Now let’s connect to the SQLite database and create two tables Orders and OrderDetails.

from sqlalchemy import create_engine
from sqlalchemy.orm import relationship
from sqlalchemy.orm import declarative_base
from sqlalchemy import Column, Integer, DateTime, String, ForeignKey

engine = create_engine('sqlite:///example', echo=True)

Base = declarative_base()

class Orders(Base):
    __tablename__ = 'orders'

    id = Column(Integer, primary_key=True, nullable=False)
    customer_id = Column(Integer)
    order_date = Column(DateTime)
    status = Column(String)

class OrderDetails(Base):
    __tablename__ = 'order_details'
    
    id = Column(Integer, primary_key=True, nullable=False)
    product_id = Column(Integer)
    quantity = Column(Integer)
    order_id = Column(Integer, ForeignKey(Orders.id), nullable=False)
    order = relationship("Orders")

Above, the declarative_base callable returns a new base class from which all mapped classes should inherit. The create_engine() function produces an Engine object based on a URL. The Engine, once created, can either be used directly to interact with the database, or can be passed to a Session object to work with the ORM. The relationship directive tells the ORM that the OrderDetails class should be linked to the Orders class through the attribute Orders.id. It uses the foreign key relationships to determine the nature of the linkage, in this case, many-to-one relationship from OrderDetails to Orders.

Now, create all the tables defined in the metadata using the create_all() method and add some records to both tables through a session. Session establishes all conversations with the database and provides the interface where SELECT and other queries are made that will return and modify ORM-mapped objects.

Base.metadata.create_all(engine)
Click to view output
2023-04-20 19:10:59,155 INFO sqlalchemy.engine.Engine BEGIN (implicit)


INFO:sqlalchemy.engine.Engine:BEGIN (implicit)


2023-04-20 19:10:59,157 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("orders")


INFO:sqlalchemy.engine.Engine:PRAGMA main.table_info("orders")


2023-04-20 19:10:59,159 INFO sqlalchemy.engine.Engine [raw sql] ()


INFO:sqlalchemy.engine.Engine:[raw sql] ()


2023-04-20 19:10:59,162 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("orders")


INFO:sqlalchemy.engine.Engine:PRAGMA temp.table_info("orders")


2023-04-20 19:10:59,163 INFO sqlalchemy.engine.Engine [raw sql] ()


INFO:sqlalchemy.engine.Engine:[raw sql] ()


2023-04-20 19:10:59,164 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("order_details")


INFO:sqlalchemy.engine.Engine:PRAGMA main.table_info("order_details")


2023-04-20 19:10:59,165 INFO sqlalchemy.engine.Engine [raw sql] ()


INFO:sqlalchemy.engine.Engine:[raw sql] ()


2023-04-20 19:10:59,167 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("order_details")


INFO:sqlalchemy.engine.Engine:PRAGMA temp.table_info("order_details")


2023-04-20 19:10:59,168 INFO sqlalchemy.engine.Engine [raw sql] ()


INFO:sqlalchemy.engine.Engine:[raw sql] ()


2023-04-20 19:10:59,171 INFO sqlalchemy.engine.Engine 
CREATE TABLE orders (
    id INTEGER NOT NULL, 
    customer_id INTEGER, 
    order_date DATETIME, 
    status VARCHAR, 
    PRIMARY KEY (id)
)




INFO:sqlalchemy.engine.Engine:
CREATE TABLE orders (
    id INTEGER NOT NULL, 
    customer_id INTEGER, 
    order_date DATETIME, 
    status VARCHAR, 
    PRIMARY KEY (id)
)




2023-04-20 19:10:59,172 INFO sqlalchemy.engine.Engine [no key 0.00097s] ()


INFO:sqlalchemy.engine.Engine:[no key 0.00097s] ()


2023-04-20 19:10:59,174 INFO sqlalchemy.engine.Engine 
CREATE TABLE order_details (
    id INTEGER NOT NULL, 
    product_id INTEGER, 
    quantity INTEGER, 
    order_id INTEGER NOT NULL, 
    PRIMARY KEY (id), 
    FOREIGN KEY(order_id) REFERENCES orders (id)
)




INFO:sqlalchemy.engine.Engine:
CREATE TABLE order_details (
    id INTEGER NOT NULL, 
    product_id INTEGER, 
    quantity INTEGER, 
    order_id INTEGER NOT NULL, 
    PRIMARY KEY (id), 
    FOREIGN KEY(order_id) REFERENCES orders (id)
)




2023-04-20 19:10:59,176 INFO sqlalchemy.engine.Engine [no key 0.00117s] ()


INFO:sqlalchemy.engine.Engine:[no key 0.00117s] ()


2023-04-20 19:10:59,178 INFO sqlalchemy.engine.Engine COMMIT


INFO:sqlalchemy.engine.Engine:COMMIT

from datetime import datetime
from sqlalchemy.orm import Session

session = Session(engine)
order = Orders(customer_id=192, order_date=datetime.now(), status='P')
session.add(order)
session.commit()

orderDetails1 = OrderDetails(order=order, product_id=4321, quantity=2)
orderDetails2 = OrderDetails(order=order, product_id=5512, quantity=1)
session.add(orderDetails1)
session.add(orderDetails2)
session.commit()
Click to view output
2023-04-20 19:10:59,695 INFO sqlalchemy.engine.Engine BEGIN (implicit)


INFO:sqlalchemy.engine.Engine:BEGIN (implicit)


2023-04-20 19:10:59,699 INFO sqlalchemy.engine.Engine INSERT INTO orders (customer_id, order_date, status) VALUES (?, ?, ?)


INFO:sqlalchemy.engine.Engine:INSERT INTO orders (customer_id, order_date, status) VALUES (?, ?, ?)


2023-04-20 19:10:59,700 INFO sqlalchemy.engine.Engine [generated in 0.00162s] (192, '2023-04-20 19:10:59.689123', 'P')


INFO:sqlalchemy.engine.Engine:[generated in 0.00162s] (192, '2023-04-20 19:10:59.689123', 'P')


2023-04-20 19:10:59,703 INFO sqlalchemy.engine.Engine COMMIT


INFO:sqlalchemy.engine.Engine:COMMIT


2023-04-20 19:10:59,706 INFO sqlalchemy.engine.Engine BEGIN (implicit)


INFO:sqlalchemy.engine.Engine:BEGIN (implicit)


2023-04-20 19:10:59,710 INFO sqlalchemy.engine.Engine SELECT orders.id AS orders_id, orders.customer_id AS orders_customer_id, orders.order_date AS orders_order_date, orders.status AS orders_status 
FROM orders 
WHERE orders.id = ?


INFO:sqlalchemy.engine.Engine:SELECT orders.id AS orders_id, orders.customer_id AS orders_customer_id, orders.order_date AS orders_order_date, orders.status AS orders_status 
FROM orders 
WHERE orders.id = ?


2023-04-20 19:10:59,711 INFO sqlalchemy.engine.Engine [generated in 0.00102s] (1,)


INFO:sqlalchemy.engine.Engine:[generated in 0.00102s] (1,)


2023-04-20 19:10:59,715 INFO sqlalchemy.engine.Engine INSERT INTO order_details (product_id, quantity, order_id) VALUES (?, ?, ?), (?, ?, ?) RETURNING id


INFO:sqlalchemy.engine.Engine:INSERT INTO order_details (product_id, quantity, order_id) VALUES (?, ?, ?), (?, ?, ?) RETURNING id


2023-04-20 19:10:59,717 INFO sqlalchemy.engine.Engine [generated in 0.00137s (insertmanyvalues)] (4321, 2, 1, 5512, 1, 1)


INFO:sqlalchemy.engine.Engine:[generated in 0.00137s (insertmanyvalues)] (4321, 2, 1, 5512, 1, 1)


2023-04-20 19:10:59,723 INFO sqlalchemy.engine.Engine COMMIT


INFO:sqlalchemy.engine.Engine:COMMIT

orderDetails = session.query(OrderDetails).all()
for detail in orderDetails:
    print(detail.quantity)
Click to view output
2023-04-20 19:11:00,763 INFO sqlalchemy.engine.Engine BEGIN (implicit)


INFO:sqlalchemy.engine.Engine:BEGIN (implicit)


2023-04-20 19:11:00,766 INFO sqlalchemy.engine.Engine SELECT order_details.id AS order_details_id, order_details.product_id AS order_details_product_id, order_details.quantity AS order_details_quantity, order_details.order_id AS order_details_order_id 
FROM order_details


INFO:sqlalchemy.engine.Engine:SELECT order_details.id AS order_details_id, order_details.product_id AS order_details_product_id, order_details.quantity AS order_details_quantity, order_details.order_id AS order_details_order_id 
FROM order_details


2023-04-20 19:11:00,767 INFO sqlalchemy.engine.Engine [generated in 0.00163s] ()


INFO:sqlalchemy.engine.Engine:[generated in 0.00163s] ()


2
1

ORMs like SQLAlchemy help speed up development time by providing a layer of abstraction and allowing the user to focus on core business logic rather than boilerplate code used for CRUD operations. They often ship with a lot of great features like portability across multiple databases. However, they do have some downsides:

  • Developers need to be well-versed in the syntax of SqlAlchemy.

  • ORMs are generally slower than using raw SQL since they act as middleware between the code and the query execution.

  • Sometimes raw SQL is the way to go when writing complex queries that are difficult to be executed through SQLAlchemy.