Query Your Dataset with Natural Language Using JupySQL and Haystack Agents

This tutorial and complementary scripts were co-authored with Tuana Çelik, Developer Advocate at Haystack.

In this tutorial, we will see how we can use JupySQL and Haystack together to create a Jupyter Notebook tool which allows you to query local SQL databases in natural language.

We will create a custom Haystack Component and an Agent that will allow us to ask a query in human language and get a response based on this dataset.

In the end, we will have an Agent that is able to:

  • Consume a query such as: “How many entries are there in total?”
  • Generate the follwing query:SELECT COUNT(*) FROM absenteeism;
  • Reply in natural language

To learn more about how to create and use an Agent, check out the Haystack Answering Multihop Questions with Agents Tutorial.

Let’s get started!

What is Haystack?

Haystack is an open-source framework for building search systems that work intelligently over large document collections. Recent advances in NLP have enabled the application of question answering, retrieval and summarization to real world settings and Haystack is designed to be the bridge between research and industry. Read more to learn about the types of Natural Language tasks you can solve with Haystack here.

Image source: https://haystack.deepset.ai/overview/intro

Key components involved in Haystack

There are four levels on which you can interact with the components in Haystack:

  1. Nodes: they perform different kinds of text processing. These are often powered by the latest transformer models.
  2. Pipelines: while nodes allow the execution of text processing tasks, pipelines are what allow you to connect different nodes together.
  3. Agents: prompt-based component that uses a large language model and employs reasoning to answer complex questions beyond the capabilities of extractive or generative question answering.
  4. REST API: you can deploy your Haystack application as a REST API.

In this blog, we will explore how we can combine the elements in Haystack to the elements in JupySQL to create a custom agent for answering questions about data from Jupyter notebooks.

Dataset: Absenteeism at work

We’ll be working with the Absenteeism at work dataset from the UCI Machine Learning Repository. This dataset provides diverse combinations of attributes, facilitating a broad range of research topics. It includes data on absenteeism categorized by factors such as the individual’s ID, reason for absence, month of absence, day of the week, seasons, and transportation expense, among others.

You can learn more about the dataset here.

Setting up JupySQL and Haystack

First, let’s install the necessary libraries. You can comment out the installation lines after running them once.

%pip install -q jupysql --upgrade duckdb-engine --quiet
%pip install -q farm-haystack[colab]

Download and clean the data with our helper script

The helper script datadownload.py will extract .zip files from the UCI repository, perform data cleaning, and save the clean data to a file for us to load to a database. You can find the helper scripts used here

!python datadownload.py

Load the engine and setup a DuckDB instance

We can now use JupySQL’s magics % and %% to load the extension, initialize the database and load the clean data to a DuckDB instance.

%reload_ext sql
%sql duckdb:///absenteeism.duck.db
%%sql
create or replace table absenteeism as
from read_csv_auto('Absenteeism_at_work_cleaned.csv', header=True, sep=';')

Exploring the JupySQL Agent

Haystack supports the creation of custom agents. Below we share an example.

The JupySQL Agent leverages an AI model to convert your natural language queries into SQL queries, fetches data based on these SQL queries from your database, and then responds in natural language.

To start querying your SQL database, you simply input a question in human language.

For instance, you might ask, “How many entries are there in total?” The JupySQL Agent translates this into an SQL query like

SELECT COUNT(*) FROM absenteeism; 

and returns the answer in a natural language format.

from haystack.nodes.base import BaseComponent

class JupySQLQuery(BaseComponent):
    outgoing_edges = 1
    
    def __init__(self):
      %reload_ext sql

    def run(self, query: str):
        result = %sql {{query}}
        output = {
            "results":  f"{result}",
            "query": query,
            
        }
        return output

    def run_batch(self, queries: list):
        results = []
        for query in queries:
            result = %sql {query}
            output = {
                "results":  f"{result}",
                "query": query,
            }
            results.append(output)
        return results

    
jupy_sql_query = JupySQLQuery()

How does the agent work?

The JupySQLQuery is a subclass of the BaseComponent class in Haystack.

BaseComponent is an abstract base class provided by Haystack for creating custom pipeline components.

This class provides an interface for all pipeline components and includes methods and properties necessary to function within a Haystack pipeline.

In this way, JupySQLQuery is a custom class that extends the BaseComponent class and is used to execute SQL queries in a Haystack pipeline within Jupyter notebooks.

When the run() method is invoked, it executes the given query, retrieves the result, and returns a dictionary with the results and the query.

When the run_batch() method is invoked, it will take in a list of SQL queries and return a list of results. Each result is a dictionary containing the original query and its corresponding result.

Interacting with the Agent

You will need to provide an API Key for GPT-4, the language model powering the JupySQL Agent.

The agent uses a specific prompt template to generate SQL queries. It comes pre-loaded with the columns of the absenteeism database, so you don’t have to worry about memorizing column names.

Create a .env file with the following structure:

OPENAI-API=<your-openai-api-key>
OPENAI-MODEL=gpt-4

You can change the model to gpt-3 or other based on your preference.

Save the .env file.

Here’s an example of how to use the agent:

from haystack.agents import Tool
from jupysqlagent import agent

jupy_sql_query_tool = Tool(name="JupySQL_Query", 
                           pipeline_or_node=jupy_sql_query, 
                           description="""This tool is useful for consuming SQL queries \
                                        and responds with the result""")
agent.add_tool(jupy_sql_query_tool)

Asking the agent to answer a single question.

result = agent.run("Find the top 3 ages with the highest total absenteeism hours, excluding disciplinary failures. ")

Asking the agent to answer multiple questions.

agent.run("Find an answer to the following list of questions: \n\n\
          1. What is the average age of people who are social drinkers? \n\
          2. What is the average age of people who are social smokers? \n\
          3. What is the average age of people who are social drinkers and social smokers? ")

The agent also comes equipped with various tools, including the SQLQuery tool, which it uses to run the generated SQL queries and fetch data from your database.

By utilizing large language models and SQL querying capability, the JupySQL Agent makes the task of retrieving information from databases a breeze.

Join our Slack community if you need any further assistance with the JupySQL Agent.