We recently released ploomber
, a workflow management tool to accelerate DS/ML pipeline development. Check it out!
Recent Posts
Leveraging parquet's metadata to self-document data files
tl; dr; This post shows how you can include data documentation in a parquet file by decorating the function that generated it
Documenting code makes it more readable to us and others. The same applies to data; it is a good practice to include relevant information to make it more accessible. A straightforward approach is to have a separate file along with the data file. For example, if we go to UCI’s ML Repository to download the iris data set, we’ll see two download options: Data Folder and Data Set Description.
read more
Open-source Workflow Management Tools: A Survey
Introduction Operationalizing Data Science projects is no trivial task. At the very least, data analysis workflows have to run on a regular basis to produce up-to-date results: a report with last week’s data or re-training a Machine Learning model due to concept drift. In some cases, the output of such workflows needs to be exposed as an API, for example, a trained Machine Learning model that generates predictions by hitting a REST endpoint.
read more
Training-serving skew
Training-serving skew is one of the most common problems when deploying Machine Learning models. This post explains what it is and how to prevent it.
A typical Machine Learning workflow When training a Machine Learning model, we always follow the same series of steps:
Get data (usually from a database) Clean it (e.g. fix/discard corrupted observations) Generate features Train model Evaluate model Once we clean the data (2), we apply transformations (3) to it to make the learning problem easier.
read more