In this tutorial, we’ll show you how to build a complete Flask application that allows uploading PDFs into a database, perform vector similarity search, and answer questions about them using OpenAI. By the end of this tutorial you’ll have a fully deployed application.
To follow this tutorial, you’ll need a Ploomber Cloud account, and an OpenAI account.
Download code
First, let’s install the ploomber-cloud
CLI:
pip install ploomber-cloud
Download the sample code:
ploomber-cloud examples flask/pdf-loader
cd pdf-loader
And let’s initialize our project:
ploomber-cloud init
Since running OCR on PDF files is compute-intensive, we’ll increase the resources for our app:
ploomber-cloud resources
The command above will prompt you for GPU (enter 0), CPUs (enter 2), and memory (enter 4), you’ll see a confirmation message like this:
Resources successfully configured: 2 CPUs, 4 GB RAM, 0 GPUs.
The next sections explain the architecture and code in detail, if you rather want skip to
the deployment part, go to the Deploy project
section.
Architecture
Let’s examine the components of our system at a high-level.
The Flask app receives requests from our users and responds. It’ll ask users to authenticate (or create an account) and return the pages to see uploaded PDFs and to ask questions.
The Database (SQLite) stores parsed text from the PDFs and the embeddings we compute for each one (using OpenAI). It also performs similarity search (via sqlite-vec
) so we can retrieve the relevant PDFs when asking a question.
Since parsing the text from a PDF and computing embeddings takes some time, we have a job queue that allows us to execute this work in the background. The RabbitMQ server acts as a message broker, storing the background jobs in a queue, while the Celery worker processes these jobs by running OCR on each uploaded PDF, storing the text in the database, and computing embeddings that we’ll use later for similarity search.
Finally, Supervisor is a process control system that ensures our application components stay running. If any process crashes (like the Flask app or Celery worker), Supervisor automatically restarts it, making our application more reliable and resilient to failures.
Code walkthrough
The flask application (app.py
)
Let’s go through the main endpoints in our Flask application:
Authentication endpoints (/login
, /register
)
The login endpoint handles both GET
and POST
requests. For POST
requests, it validates credentials and creates a session:
@app.route("/login", methods=["GET", "POST"])
def login():
if request.method == "POST":
email = request.form.get("email")
password = request.form.get("password")
# Validate credentials and create session
with Session(engine) as db_session:
user = db_session.query(User).filter_by(email=email).first()
if user and user.check_password(password):
session["email"] = user.email
return redirect(url_for("index"))
# ... error handling
The /register
endpoint creates new user accounts:
@app.route("/register", methods=["GET", "POST"])
def register():
if request.method == "POST":
email = request.form.get("email")
password = request.form.get("password")
with Session(engine) as db_session:
user = User(email=email)
user.set_password(password)
db_session.add(user)
# ... error handling
Document management (/documents
, /upload
)
The /documents
endpoint displays uploaded PDFs while /upload
handles file uploads and triggers background processing:
@app.post("/upload")
@login_required
def upload():
files = request.files.getlist("files[]")
with Session(engine) as db_session:
for file in files:
# Save PDF file
file_path = upload_dir / file.filename
file.save(file_path)
# Create document record and trigger processing
document = Document(name=file.filename, status=DocumentStatus.PENDING)
db_session.add(document)
# Run the process in the background
process_pdf_document.delay(file.filename)
Search functionality (/search
)
The search endpoint processes queries and returns answers based on the stored documents:
@app.post("/search")
@login_required
def search_post():
query = request.form["query"]
answer = answer_query(query)
return render_template("search-results.html", answer=answer)
Question Answering
The question answering functionality is implemented in answer.py
. Here’s how it works:
- When a user submits a query, we compute an embedding vector for their question:
embedding = compute_embedding(query, return_single=True)
- We use this embedding to find the 5 most relevant documents via similarity search:
similar_docs = Document.find_similar(
db_session,
embedding=embedding,
limit=5,
)
- The relevant documents and query are sent to GPT-4o mini through OpenAI’s chat API:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "\n\n".join([doc.content for doc in similar_docs])},
{"role": "user", "content": query},
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
)
The system prompt instructs the model to only answer based on the provided documents and respond in markdown format. The response is then converted to HTML before being displayed to the user.
Background Processing
The background processing functionality in background.py
handles PDF document processing asynchronously using Celery. Here’s how it works:
- The main task
process_pdf_document
handles PDF processing:
@app.task
def process_pdf_document(filename: str):
with Session(engine) as db_session:
document = db_session.query(Document).filter(Document.name == filename).first()
document.status = DocumentStatus.PROCESSING
db_session.commit()
content = pdf_ocr(filename)
embedding = compute_embedding(content[:5000])
document.content = content
document.embedding = embedding
document.status = DocumentStatus.COMPLETED
db_session.commit()
- The OCR functionality uses EasyOCR and PyMuPDF to extract text:
def pdf_ocr(filename: str) -> str:
content = Path(SETTINGS.PATH_TO_UPLOADS / filename).read_bytes()
reader = easyocr.Reader(["en"])
pdf = fitz.open(stream=content, filetype="pdf")
# Process each page and extract text
extracted_text = []
for page in pdf:
pix = page.get_pixmap()
results = reader.readtext(pix.tobytes())
page_text = " ".join([text[1] for text in results])
extracted_text.append(page_text)
return "\n\n".join(extracted_text)
The process updates document status in the database as it progresses, from PROCESSING
to COMPLETED
, and stores both the extracted text content and its embedding vector.
Deploy project
Move to the sample code you downloaded in the first step (pdf-loader
) and create a
.env
file (note the leading dot) with the following content:
OPENAI_API_KEY=<YOUR-OPENAI-TOKEN>
And replace the values with your OpenAI key.
To deploy:
ploomber-cloud deploy
The command will print a URL to track progress. After a couple of minutes, the app will be available:
Once that happens, open the app URL and you’ll see the form to login. Click on
Register here
to create a new account:
Enter email and password for the new account:
Once you’re logged in, click on PDFs
:
Drop some PDFs so the app can parse the text (here are some sample PDFs):
For small PDFs (<10 pages), processing should take less than a minute each. Refresh your browser to keep track of progress, once your PDFs are processed they will look like this:
Click on Search
and start asking questions about your documents! Here’s a sample question we asked (this PDF contains the answer):
Limitations
This application is a proof of concept designed to demonstrate how to build a scalable PDF parser application with AI capabilities. As such, it has several limitations that make it unsuitable for production use.
We’ve made some architectural choices to keep things simple, such as using EasyOCR to parse the PDFs (which often returns low-quality results for moderately complex PDFs), compute embeddings using the first 5,000 parsed characters, and limiting the number of PDFs we process (one at a time in a queue). Additionally, the lack of persistent storage means all data is wiped when the app restarts. If you’re interested in deploying a production-ready system to ingest all your PDF documents, please reach out to us to discuss our enterprise-grade solutions that address these limitations.