preloader
blog-post

Distribute your Python pipeline the right way (by packaging it)

author image

Pipelines are a bunch of source code files that, when executed in the right order, produce a final result (e.g. a chart, report, ML model, etc). Whatever your goal is, you want your code to be easily distributed and executed in computers other than yours for reproducibility, scaling up computations or taking a model to production.

Given how easy is to package your Python code and the advantages that come with it, there is no reason not to do it. All you need is to include a setup.py file in your project.

At its core, your setup.py file is a place to provide information to the setuptools package to know how to package your code. There lots of ways to customize it but you just need a few directives to get going.

Note: All the example code is located here

# (minimal) setup.py

# ... a few imports

setup(
    # your package name
    name='my_package',
    # your code location: a src/ folder in the same location as this file
    packages=find_packages('src'),
    package_dir={'': 'src'},
    py_modules=[splitext(basename(path))[0] for path in glob('src/*.py')],
    # include these extensions as part of your package (otherwise they will
    # not be included)
    package_data={"": ["*.txt", "*.rst", "*.sql", "*.ipynb"]},
    # list your dependencies, they will be installed when installing this
    # package
    install_requires=[],
    # optional dependencies
    extras_require={},
)

See the official guide for details on the available options.

Once you have a setup.py file, you can install your package using:

# setup.py is located in /path/to/your/project/
cd /path/to/your/project
pip install .

Doing this will install your package in the current Python environment, taking care of moving your source code to an appropriate location and letting Python know where to find it. This comes with a lot of advantages:

Import your code anywhere

Once your code is installed you can import modules (folders or files within your package) like this:

from my_package import my_module

my_module.my_function()

This import will work in any directory (no more PYTHONPATH editing!). If you use Jupyter notebooks, you will find this quite convenient: you can keep your logic organized in files within your package and import them in your notebooks.

Clean access to static resources (Jupyter notebooks, SQL scripts)

Your pipeline likely to depend on files with extensions other than .py (e.g. Jupyter notebooks, SQL scripts). Loading those files using hardcoded paths (either absolute or relative) is a terrible idea since they will easily break if you move your code somewhere else of change their relative structure.

Once you install your package, you can easily load these files without hardcoding them by using pkgutil (part of the standard library).

import pkgutil

# if your script is under src/my_package/sql/load_data.sql
sql_core = pkgutil.get_data('my_package', 'sql/load_data.sql')

pkgutil is not the only way to load static files, click here for a discussion.

Note: for non-Python files to be included in your package, you have to include the package_data directive in your setup.py file.

Command line entry points

For others (or even you) to execute your pipeline, you can provide “entry points”, which make your code available from a shell session. Once your package is installed you can execute files like this:

# execute src/my_package/my_module.py
python -m my_package.my_module

Note that you do not have to specify the file’s location, as the Python environment already knows how to find it based on the package name. It is a good practice for files executed this way to have the following structure:

# src/my_package/my_module.py


def main_func():
    pass


def another_function():
    pass


if __name__ == '__main__':
    main_func()

When running python -m my_package.my_module, the code under the if statement will be executed, but when importing it via from my_package import my_module it won’t.

Apart from from using the python -m option, you can also provide custom commands like this:

# inside a shell session:
my_command

For that to work you have to specify your commands in the setup.py, which will look like this:

# setup.py
# ...
# ...
# ...
setup(
    # ...
    entry_points={
        "console_scripts": [
            # when calling "my_command" in the shell
            # the function main_func in my_package/my_module.py
            # will be executed
            "my_command = my_package.my_module:main_func",
        ],
    },
)

Click here for documentation on entry_points.

Editable mode

Running pip install some_package copies the package source code to the current Python environment, which means that any changes introduced after installation will not be reflected. During development, this is undesirable but can be easily fixed by installing your package in “editable” mode:

cd /path/to/your/project
pip install --editable .

Installing it this way will not copy your code, but just tell your Python environment to use the code in /path/to/your/project, which means any code changes will be propagated.

There is another consideration, though. Once a Python module is loaded, it will not be reloaded within the same session, which means you’ll have to restart it to see changes. If you are using IPython, you can do live reloading using the autoreload extension (click here for documentation):

%load_ext autoreload
%autoreload 2

Dependency management

Your pipeline most likely will depend on other packages to work (e.g. numpy, scikit-learn, etc.). While you can provide this in a requirements.txt file, the correct way to provide package dependencies is through the install_requires directive in setup.py , these dependencies will be resolved during installation. Click here for formatting details.

Bootstrapping your projects

Packaging your pipeline will make life easier for you and others, there is no reason no to do it given how easy it is. To bootstrap this process, we are providing this template.

All you have to do to get the base folder structure is:

curl -O -L https://github.com/ploomber/template/archive/master.zip
unzip master.zip
# this will prompt you for your package name
bash template-master/install.sh
rm -f master.zip

After running the following structure will be created:

└── my_package
    ├── README.md
    ├── setup.py
    └── src
        └── my_package
            └── __init__.py
comments powered by Disqus

Recent Articles

blog-post

Ploomber vs. Apache Airflow

In this article, I will compare the main differences between Ploomber and Apache Airflow and cover some background …

Try Ploomber Cloud Now

Get Started
*