In this tutorial, we’ll explore how to build a web scraper using Selenium and Streamlit. Our project will focus on extracting data from Wikipedia, specifically the list of Mercury Prize winners. By combining Selenium’s web automation capabilities with Streamlit’s user-friendly interface, we’ll create an interactive application that allows users to fetch and display up-to-date information with just a click of a button.
Throughout this tutorial, we’ll cover the full deployment process in detail. We’ll explore how to create a Dockerfile that sets up the necessary dependencies, installs Chrome and ChromeDriver, and configures the environment for our Selenium-based scraper.
By containerizing our application, we’ll make it easy to deploy and scale, whether you’re running it locally or in a cloud environment. This approach not only simplifies the deployment process but also ensures that our application runs identically in development and production settings.
Streamlit code
Let’s start by examining the core of our application: the Streamlit code. This code creates a simple user interface and handles the web scraping functionality. Here’s a breakdown of what our app.py
file looks like:
import streamlit as st
st.title("Mercury Prize Winners")
if st.button("Load from Wikipedia"):
content = get_table()
st.html(content)
The app is simple: we create button labeled “Load from Wikipedia”. When the button is clicked, it triggers the get_table()
function (which we’ll define shortly) to fetch the data from Wikipedia. The retrieved content is then displayed using Streamlit’s st.html()
function.
Now, let’s look at the get_table()
:
def get_table():
url = "https://en.wikipedia.org/wiki/Mercury_Prize"
xpath = "//table[contains(@class, 'wikitable')]"
chrome_options = Options()
chrome_options.add_argument("--headless")
# we need this since we'll run the container as root
chrome_options.add_argument("--no-sandbox")
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
# wait for the table to load
element = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.XPATH, xpath))
)
element_html = element.get_attribute("outerHTML")
# remove links
soup = BeautifulSoup(element_html, "html.parser")
for a in soup.find_all("a"):
a.replace_with(a.text)
# return the cleaned html
return str(soup)
This get_table()
function is responsible for scraping the Mercury Prize winners table from Wikipedia. It uses Selenium to navigate to the page and wait for the table to load, then extracts the table’s HTML content. The function sets up Chrome options for headless operation and initializes a WebDriver to interact with the page dynamically.
Once the table is loaded, the function uses BeautifulSoup to parse and clean the HTML content. It removes all links from the table by replacing each <a>
tag with its text content, simplifying the table structure. Finally, it returns the cleaned HTML as a string, efficiently combining Selenium for dynamic page interaction and BeautifulSoup for HTML parsing to extract the desired information from Wikipedia.
Dockerfile
Let’s examine our Dockerfile, which sets up the environment for our Selenium-based web scraping application. We’ll break down the Dockerfile step by step to understand how it creates a container that can run our Streamlit app with the necessary dependencies.
# chrome is not available for ARM
FROM --platform=linux/amd64 python:3.12-slim
WORKDIR /srv
We start with a Python 3.12 image for AMD64 architecture. This is crucial for running locally on Macs with M chips, as Chrome isn’t available for ARM. The AMD64 specification ensures compatibility across systems.
The working directory is set to /srv
for our application code.
Let’s now see how to install Google Chrome:
RUN apt-get update && apt-get install -y wget curl unzip
RUN wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
RUN apt-get install -y ./google-chrome-stable_current_amd64.deb
RUN google-chrome --version
These commands update the package lists, download the latest stable version of Google Chrome for AMD64 architecture, install it, and then verify the installation by printing the Chrome version.
Next, we set up ChromeDriver, which is essential for Selenium to interact with Chrome:
RUN curl -o chromedriver_linux64.zip https://storage.googleapis.com/chrome-for-testing-public/128.0.6613.86/linux64/chromedriver-linux64.zip
RUN unzip chromedriver_linux64.zip
RUN chmod +x chromedriver-linux64
RUN mv -f chromedriver-linux64 /usr/local/bin/chromedriver
These commands download the ChromeDriver compatible with the installed Chrome version, unzip it, make it executable, and move it to a directory in the system PATH for easy access by Selenium.
After setting up Chrome and ChromeDriver, we install our Python dependencies:
COPY requirements.txt /srv/
RUN pip install -r requirements.txt --no-cache-dir
Finally, we copy all our source code into the /srv
directory. The ENTRYPOINT
command sets up the entrypoint for our Streamlit application:
COPY . /srv
ENTRYPOINT ["streamlit", "run", "app.py", \
"--server.port=80", \
"--server.headless=true", \
"--server.address=0.0.0.0", \
"--browser.gatherUsageStats=false", \
"--server.enableStaticServing=true", \
"--server.fileWatcherType=none", \
"--client.toolbarMode=viewer"]
Deployment
Complete source code is available on GitHub.
Let’s now deploy to Ploomber Cloud. Note that we’ll first have to create an account and get an API key.
Now, let’s install the command-line interface and store the API key:
pip install ploomber-cloud
ploomber-cloud key YOURAPIKEY
Initialize the project (you’ll be prompted to confirm this is a Docker project):
ploomber-cloud init
Initializing new project...
Inferred project type: 'docker'
Is this correct? [y/N]: y
Your app 'lingering-sea-4519' has been configured successfully!
To configure resources for this project, run 'ploomber-cloud resources' or to deploy with default configurations, run 'ploomber-cloud deploy'
Deploy:
ploomber-cloud deploy
The command above will print a URL where you can track progress. Once deployed, you’ll be able to test your app!