Build an app to answer questions on your PDF documents

Eduardo Blancas

Jan 14, 2025 - 6 Min read

Deploy AI Apps with Ploomber

In this tutorial, we’ll show you how to build a complete Flask application that allows uploading PDFs into a database, perform vector similarity search, and answer questions about them using OpenAI. By the end of this tutorial you’ll have a fully deployed application.

To follow this tutorial, you’ll need a Ploomber Cloud account, and an OpenAI account.

Download code

First, let’s install the ploomber-cloud CLI:

pip install ploomber-cloud

Download the sample code:

ploomber-cloud examples flask/pdf-loader
cd pdf-loader

And let’s initialize our project:

ploomber-cloud init

Since running OCR on PDF files is compute-intensive, we’ll increase the resources for our app:

ploomber-cloud resources

The command above will prompt you for GPU (enter 0), CPUs (enter 2), and memory (enter 4), you’ll see a confirmation message like this:

Resources successfully configured: 2 CPUs, 4 GB RAM, 0 GPUs.

The next sections explain the architecture and code in detail, if you rather want skip to the deployment part, go to the Deploy project section.

Architecture

Let’s examine the components of our system at a high-level.

The Flask app receives requests from our users and responds. It’ll ask users to authenticate (or create an account) and return the pages to see uploaded PDFs and to ask questions.

The Database (SQLite) stores parsed text from the PDFs and the embeddings we compute for each one (using OpenAI). It also performs similarity search (via sqlite-vec) so we can retrieve the relevant PDFs when asking a question.

Since parsing the text from a PDF and computing embeddings takes some time, we have a job queue that allows us to execute this work in the background. The RabbitMQ server acts as a message broker, storing the background jobs in a queue, while the Celery worker processes these jobs by running OCR on each uploaded PDF, storing the text in the database, and computing embeddings that we’ll use later for similarity search.

Finally, Supervisor is a process control system that ensures our application components stay running. If any process crashes (like the Flask app or Celery worker), Supervisor automatically restarts it, making our application more reliable and resilient to failures.

Code walkthrough

The flask application (`app.py`)

Let’s go through the main endpoints in our Flask application:

Authentication endpoints (/login, /register)

The login endpoint handles both GET and POST requests. For POST requests, it validates credentials and creates a session:

@app.route("/login", methods=["GET", "POST"])
def login():
    if request.method == "POST":
        email = request.form.get("email")
        password = request.form.get("password")
        
        # Validate credentials and create session
        with Session(engine) as db_session:
            user = db_session.query(User).filter_by(email=email).first()
            if user and user.check_password(password):
                session["email"] = user.email
                return redirect(url_for("index"))
            # ... error handling

The /register endpoint creates new user accounts:

@app.route("/register", methods=["GET", "POST"])
def register():
    if request.method == "POST":
        email = request.form.get("email")
        password = request.form.get("password")
        
        with Session(engine) as db_session:
            user = User(email=email)
            user.set_password(password)
            db_session.add(user)
            # ... error handling

Document management (/documents, /upload)

The /documents endpoint displays uploaded PDFs while /upload handles file uploads and triggers background processing:

@app.post("/upload")
@login_required
def upload():
    files = request.files.getlist("files[]")
    
    with Session(engine) as db_session:
        for file in files:
            # Save PDF file
            file_path = upload_dir / file.filename
            file.save(file_path)
            
            # Create document record and trigger processing
            document = Document(name=file.filename, status=DocumentStatus.PENDING)
            db_session.add(document)
            # Run the process in the background
            process_pdf_document.delay(file.filename)

Search functionality (/search)

The search endpoint processes queries and returns answers based on the stored documents:

@app.post("/search")
@login_required
def search_post():
    query = request.form["query"]
    answer = answer_query(query)
    return render_template("search-results.html", answer=answer)

Question Answering

The question answering functionality is implemented in answer.py. Here’s how it works:

When a user submits a query, we compute an embedding vector for their question:

embedding = compute_embedding(query, return_single=True)

We use this embedding to find the 5 most relevant documents via similarity search:

similar_docs = Document.find_similar(
    db_session, 
    embedding=embedding,
    limit=5,
)

The relevant documents and query are sent to GPT-4o mini through OpenAI’s chat API:

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "\n\n".join([doc.content for doc in similar_docs])},
    {"role": "user", "content": query},
]

response = client.chat.completions.create(
    model="gpt-4o-mini", 
    messages=messages,
)

The system prompt instructs the model to only answer based on the provided documents and respond in markdown format. The response is then converted to HTML before being displayed to the user.

Background Processing

The background processing functionality in background.py handles PDF document processing asynchronously using Celery. Here’s how it works:

The main task process_pdf_document handles PDF processing:

@app.task
def process_pdf_document(filename: str):
    with Session(engine) as db_session:
        document = db_session.query(Document).filter(Document.name == filename).first()
        document.status = DocumentStatus.PROCESSING
        db_session.commit()
        
        content = pdf_ocr(filename)
        embedding = compute_embedding(content[:5000])
        
        document.content = content
        document.embedding = embedding
        document.status = DocumentStatus.COMPLETED
        db_session.commit()

The OCR functionality uses EasyOCR and PyMuPDF to extract text:

def pdf_ocr(filename: str) -> str:
    content = Path(SETTINGS.PATH_TO_UPLOADS / filename).read_bytes()
    reader = easyocr.Reader(["en"])
    pdf = fitz.open(stream=content, filetype="pdf")
    
    # Process each page and extract text
    extracted_text = []
    for page in pdf:
        pix = page.get_pixmap()
        results = reader.readtext(pix.tobytes())
        page_text = " ".join([text[1] for text in results])
        extracted_text.append(page_text)
        
    return "\n\n".join(extracted_text)

The process updates document status in the database as it progresses, from PROCESSING to COMPLETED, and stores both the extracted text content and its embedding vector.

Deploy project

Move to the sample code you downloaded in the first step (pdf-loader) and create a .env file (note the leading dot) with the following content:

OPENAI_API_KEY=<YOUR-OPENAI-TOKEN>

And replace the values with your OpenAI key.

To deploy:

ploomber-cloud deploy

The command will print a URL to track progress. After a couple of minutes, the app will be available:

Once that happens, open the app URL and you’ll see the form to login. Click on Register here to create a new account:

Enter email and password for the new account:

Once you’re logged in, click on PDFs:

Drop some PDFs so the app can parse the text (here are some sample PDFs):

For small PDFs (<10 pages), processing should take less than a minute each. Refresh your browser to keep track of progress, once your PDFs are processed they will look like this:

Click on Search and start asking questions about your documents! Here’s a sample question we asked (this PDF contains the answer):

Limitations

This application is a proof of concept designed to demonstrate how to build a scalable PDF parser application with AI capabilities. As such, it has several limitations that make it unsuitable for production use.

We’ve made some architectural choices to keep things simple, such as using EasyOCR to parse the PDFs (which often returns low-quality results for moderately complex PDFs), compute embeddings using the first 5,000 parsed characters, and limiting the number of PDFs we process (one at a time in a queue). Additionally, the lack of persistent storage means all data is wiped when the app restarts. If you’re interested in deploying a production-ready system to ingest all your PDF documents, please reach out to us to discuss our enterprise-grade solutions that address these limitations.

Seamless deployment for data scientists and developers. Ploomber handles infrastructure so you focus on building. Secure and scalable—from personal projects to enterprise apps. Support for Streamlit, Dash, Docker, and AI-powered applications. Because life's too short for deployment headaches.

Deploy AI Apps with Ploomber

Explore

Build an app to answer questions on your PDF documents

Table of Contents

Download code

Architecture

Code walkthrough

The flask application (`app.py`)

Question Answering

Background Processing

Deploy project

Limitations

Deploy AI Apps with Ploomber

Recent Articles

A deep-dive into Streamlit's new authentication capabilities

An introduction to prompt injection with Prompt Guard

Build an app to answer questions on your PDF documents

Table of Contents

Download code

Architecture

Code walkthrough

The flask application (app.py)

Question Answering

Background Processing

Deploy project

Limitations

Deploy AI Apps with Ploomber

Recent Articles

A deep-dive into Streamlit's new authentication capabilities

An introduction to prompt injection with Prompt Guard

The flask application (`app.py`)