
SQL and Python Best Practices and Tips
Contents Introduction Securely storing and accessing credentials When to use Jupyter and JupySQL Magics When to use IDEs …
This tutorial and complementary scripts were co-authored with Tuana Çelik, Developer Advocate at Haystack.
In this tutorial, we will see how we can use JupySQL and Haystack together to create a Jupyter Notebook tool which allows you to query local SQL databases in natural language.
We will create a custom Haystack Component and an Agent that will allow us to ask a query in human language and get a response based on this dataset.
In the end, we will have an Agent that is able to:
SELECT COUNT(*) FROM absenteeism;
To learn more about how to create and use an Agent, check out the Haystack Answering Multihop Questions with Agents Tutorial.
Let’s get started!
Haystack is an open-source framework for building search systems that work intelligently over large document collections. Recent advances in NLP have enabled the application of question answering, retrieval and summarization to real world settings and Haystack is designed to be the bridge between research and industry. Read more to learn about the types of Natural Language tasks you can solve with Haystack here.
Image source: https://haystack.deepset.ai/overview/intro
There are four levels on which you can interact with the components in Haystack:
In this blog, we will explore how we can combine the elements in Haystack to the elements in JupySQL to create a custom agent for answering questions about data from Jupyter notebooks.
We’ll be working with the Absenteeism at work dataset from the UCI Machine Learning Repository. This dataset provides diverse combinations of attributes, facilitating a broad range of research topics. It includes data on absenteeism categorized by factors such as the individual’s ID, reason for absence, month of absence, day of the week, seasons, and transportation expense, among others.
You can learn more about the dataset here.
First, let’s install the necessary libraries. You can comment out the installation lines after running them once.
%pip install -q jupysql --upgrade duckdb-engine --quiet
%pip install -q farm-haystack[colab]
The helper script datadownload.py
will extract .zip
files from the UCI repository, perform data cleaning, and save the clean data to a file for us to load to a database. You can find the helper scripts used here
!python datadownload.py
We can now use JupySQL’s magics %
and %%
to load the extension, initialize the database and load the clean data to a DuckDB instance.
%reload_ext sql
%sql duckdb:///absenteeism.duck.db
%%sql
create or replace table absenteeism as
from read_csv_auto('Absenteeism_at_work_cleaned.csv', header=True, sep=';')
Haystack supports the creation of custom agents. Below we share an example.
The JupySQL Agent leverages an AI model to convert your natural language queries into SQL queries, fetches data based on these SQL queries from your database, and then responds in natural language.
To start querying your SQL database, you simply input a question in human language.
For instance, you might ask, “How many entries are there in total?” The JupySQL Agent translates this into an SQL query like
SELECT COUNT(*) FROM absenteeism;
and returns the answer in a natural language format.
from haystack.nodes.base import BaseComponent
class JupySQLQuery(BaseComponent):
outgoing_edges = 1
def __init__(self):
%reload_ext sql
def run(self, query: str):
result = %sql {{query}}
output = {
"results": f"{result}",
"query": query,
}
return output
def run_batch(self, queries: list):
results = []
for query in queries:
result = %sql {query}
output = {
"results": f"{result}",
"query": query,
}
results.append(output)
return results
jupy_sql_query = JupySQLQuery()
The JupySQLQuery
is a subclass of the BaseComponent
class in Haystack.
BaseComponent
is an abstract base class provided by Haystack for creating custom pipeline components.
This class provides an interface for all pipeline components and includes methods and properties necessary to function within a Haystack pipeline.
In this way, JupySQLQuery
is a custom class that extends the BaseComponent
class and is used to execute SQL queries in a Haystack pipeline within Jupyter notebooks.
When the run()
method is invoked, it executes the given query, retrieves the result, and returns a dictionary with the results and the query.
When the run_batch()
method is invoked, it will take in a list of SQL queries and return a list of results. Each result is a dictionary containing the original query and its corresponding result.
You will need to provide an API Key for GPT-4, the language model powering the JupySQL Agent.
The agent uses a specific prompt template to generate SQL queries. It comes pre-loaded with the columns of the absenteeism
database, so you don’t have to worry about memorizing column names.
Create a .env
file with the following structure:
OPENAI-API=<your-openai-api-key>
OPENAI-MODEL=gpt-4
You can change the model to gpt-3
or other based on your preference.
Save the .env
file.
Here’s an example of how to use the agent:
from haystack.agents import Tool
from jupysqlagent import agent
jupy_sql_query_tool = Tool(name="JupySQL_Query",
pipeline_or_node=jupy_sql_query,
description="""This tool is useful for consuming SQL queries \
and responds with the result""")
agent.add_tool(jupy_sql_query_tool)
Asking the agent to answer a single question.
result = agent.run("Find the top 3 ages with the highest total absenteeism hours, excluding disciplinary failures. ")
Asking the agent to answer multiple questions.
agent.run("Find an answer to the following list of questions: \n\n\
1. What is the average age of people who are social drinkers? \n\
2. What is the average age of people who are social smokers? \n\
3. What is the average age of people who are social drinkers and social smokers? ")
The agent also comes equipped with various tools, including the SQLQuery tool, which it uses to run the generated SQL queries and fetch data from your database.
By utilizing large language models and SQL querying capability, the JupySQL Agent makes the task of retrieving information from databases a breeze.
Join our Slack community if you need any further assistance with the JupySQL Agent.
Found an error? Click here to let us know.
Contents Introduction Securely storing and accessing credentials When to use Jupyter and JupySQL Magics When to use IDEs …
Running SQL queries on Jupyter notebook SQL (Structured Query Language) is a programming language designed for managing …