preloader

Building a Web Scraper with Selenium and Streamlit

author image

selenium-streamlit-demo

In this tutorial, we’ll explore how to build a web scraper using Selenium and Streamlit. Our project will focus on extracting data from Wikipedia, specifically the list of Mercury Prize winners. By combining Selenium’s web automation capabilities with Streamlit’s user-friendly interface, we’ll create an interactive application that allows users to fetch and display up-to-date information with just a click of a button.

Throughout this tutorial, we’ll cover the full deployment process in detail. We’ll explore how to create a Dockerfile that sets up the necessary dependencies, installs Chrome and ChromeDriver, and configures the environment for our Selenium-based scraper.

By containerizing our application, we’ll make it easy to deploy and scale, whether you’re running it locally or in a cloud environment. This approach not only simplifies the deployment process but also ensures that our application runs identically in development and production settings.

Streamlit code

Let’s start by examining the core of our application: the Streamlit code. This code creates a simple user interface and handles the web scraping functionality. Here’s a breakdown of what our app.py file looks like:

import streamlit as st

st.title("Mercury Prize Winners")


if st.button("Load from Wikipedia"):
    content = get_table()
    st.html(content)

The app is simple: we create button labeled “Load from Wikipedia”. When the button is clicked, it triggers the get_table() function (which we’ll define shortly) to fetch the data from Wikipedia. The retrieved content is then displayed using Streamlit’s st.html() function.

Now, let’s look at the get_table():

def get_table():
    url = "https://en.wikipedia.org/wiki/Mercury_Prize"
    xpath = "//table[contains(@class, 'wikitable')]"

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    # we need this since we'll run the container as root
    chrome_options.add_argument("--no-sandbox")
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(url)

    # wait for the table to load
    element = WebDriverWait(driver, 10).until(
        EC.visibility_of_element_located((By.XPATH, xpath))
    )

    element_html = element.get_attribute("outerHTML")

    # remove links
    soup = BeautifulSoup(element_html, "html.parser")

    for a in soup.find_all("a"):
        a.replace_with(a.text)

    # return the cleaned html
    return str(soup)

This get_table() function is responsible for scraping the Mercury Prize winners table from Wikipedia. It uses Selenium to navigate to the page and wait for the table to load, then extracts the table’s HTML content. The function sets up Chrome options for headless operation and initializes a WebDriver to interact with the page dynamically.

Once the table is loaded, the function uses BeautifulSoup to parse and clean the HTML content. It removes all links from the table by replacing each <a> tag with its text content, simplifying the table structure. Finally, it returns the cleaned HTML as a string, efficiently combining Selenium for dynamic page interaction and BeautifulSoup for HTML parsing to extract the desired information from Wikipedia.

Dockerfile

Let’s examine our Dockerfile, which sets up the environment for our Selenium-based web scraping application. We’ll break down the Dockerfile step by step to understand how it creates a container that can run our Streamlit app with the necessary dependencies.

# chrome is not available for ARM
FROM --platform=linux/amd64 python:3.12-slim

WORKDIR /srv

We start with a Python 3.12 image for AMD64 architecture. This is crucial for running locally on Macs with M chips, as Chrome isn’t available for ARM. The AMD64 specification ensures compatibility across systems.

The working directory is set to /srv for our application code.

Let’s now see how to install Google Chrome:

RUN apt-get update && apt-get install -y wget curl unzip
RUN wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
RUN apt-get install -y ./google-chrome-stable_current_amd64.deb
RUN google-chrome --version

These commands update the package lists, download the latest stable version of Google Chrome for AMD64 architecture, install it, and then verify the installation by printing the Chrome version.

Next, we set up ChromeDriver, which is essential for Selenium to interact with Chrome:

RUN curl -o chromedriver_linux64.zip https://storage.googleapis.com/chrome-for-testing-public/128.0.6613.86/linux64/chromedriver-linux64.zip
RUN unzip chromedriver_linux64.zip

RUN chmod +x chromedriver-linux64
RUN mv -f chromedriver-linux64 /usr/local/bin/chromedriver

These commands download the ChromeDriver compatible with the installed Chrome version, unzip it, make it executable, and move it to a directory in the system PATH for easy access by Selenium.

After setting up Chrome and ChromeDriver, we install our Python dependencies:

COPY requirements.txt /srv/
RUN pip install -r requirements.txt --no-cache-dir

Finally, we copy all our source code into the /srv directory. The ENTRYPOINT command sets up the entrypoint for our Streamlit application:

COPY . /srv

ENTRYPOINT ["streamlit", "run", "app.py", \
            "--server.port=80", \
            "--server.headless=true", \
            "--server.address=0.0.0.0", \
            "--browser.gatherUsageStats=false", \
            "--server.enableStaticServing=true", \
            "--server.fileWatcherType=none", \
            "--client.toolbarMode=viewer"]

Deployment

Complete source code is available on GitHub.

Let’s now deploy to Ploomber Cloud. Note that we’ll first have to create an account and get an API key.

Now, let’s install the command-line interface and store the API key:

pip install ploomber-cloud
ploomber-cloud key YOURAPIKEY

Initialize the project (you’ll be prompted to confirm this is a Docker project):

ploomber-cloud init
Initializing new project...
Inferred project type: 'docker'
Is this correct? [y/N]: y
Your app 'lingering-sea-4519' has been configured successfully!
To configure resources for this project, run 'ploomber-cloud resources' or to deploy with default configurations, run 'ploomber-cloud deploy'

Deploy:

ploomber-cloud deploy

The command above will print a URL where you can track progress. Once deployed, you’ll be able to test your app!

Deploy Streamlit Apps with Ploomber

Recent Articles

Try Ploomber Cloud Now

Get Started
*