preloader
blog-post

Leveraging parquet’s metadata to self-document data files

author image

tl; dr; This post shows how you can include data documentation in a parquet file by decorating the function that generated it

Documenting code makes it more readable to us and others. The same applies to data; it is a good practice to include relevant information to make it more accessible. A straightforward approach is to have a separate file along with the data file. For example, if we go to UCI’s ML Repository to download the iris data set, we’ll see two download options: Data Folder and Data Set Description.

I’ve started to experiment with a simple approach to document results generated by a data pipeline. Imagine you’re are working on a Machine Learning project and have a few functions that apply data transformations. Each step along the way, the data changes, and providing a few essential information helps you make your overall project more maintainable.

We can do this by adding a docstring to our data transformation:

def my_transformation(df):
    """Creates a new column x_squared
    """
    df['x_squared'] = df['x'] ** 2
    return df

This docstring is an improvement, but data transformations are rarely that simple and usually involve adding multiple new columns (or an entirely new set of them). We can improve our previous approach by providing column-level details:

def another_transformation(df):
    """Adds columns with powers of x (from 2 to 4)
    
    Returns
    -------
    x_squared : float
        x to the power 2
    x_cubed : float
        x to the power 3
    x_fourth : float
        x to the power 4
    """
    df['x_squared'] = df['x'] ** 2
    df['x_cubed'] = df['x'] ** 3
    df['x_fourth'] = df['x'] ** 4
    return df

I’m using the numpydoc format to document the output of this function. Strictly speaking, the Returns section is meant to document whatever the function returns (instead of column-level details), but this works great if everyone in the project agrees with this convention.

The main limitation is that the docstring information is embedded in the code: once we save the data file to disk, there is no way to retrieve the documentation. Fortunately, there is an easy way to include a copy of the documentation in the data itself.

The parquet format

Some data formats that support metadata storage, parquet is one of them. Including documentation in our parquet file is a simple as adding a decorator to our function. The metadata is saved in the same file, and we can retrieve it again when loading the file from disk.

Note: the source code is available here

from pprint import pprint

import pyarrow as pa
import pandas as pd
import pyarrow.parquet as pq
import numpy as np

from lib import add_metadata, read_metadata, peek_metadata

@add_metadata
def another_transformation(df):
    """
    Adds columns with powers of x (from 2 to 4)

    Returns
    -------
    x_squared : float
        x to the power 2.

    x_cubed : float
        x to the power 3.

    x_fourth : float
        x to the power 4.
    """
    df['x_squared'] = df['x'] ** 2
    df['x_cubed'] = df['x'] ** 3
    df['x_fourth'] = df['x'] ** 4
    table = pa.Table.from_pandas(df)
    # the decorator embeds the docstring
    # information in the table object
    return table
# let's generate some data
raw = pd.DataFrame({'x': np.random.rand(10)})
# apply data transformation (and automatically embed metadata)
data = another_transformation(raw)
# let's retrieve the metadata
pprint(read_metadata(data))

Console output: (1/1):

{'returns': [{'desc': 'x to the power 2.',
              'name': 'x_squared',
              'type': 'float'},
             {'desc': 'x to the power 3.', 'name': 'x_cubed', 'type': 'float'},
             {'desc': 'x to the power 4.',
              'name': 'x_fourth',
              'type': 'float'}],
 'summary': ['Adds columns with powers of x (from 2 to 4)']}
# let's save the data to a file
pq.write_table(data, 'my_data.parquet')
# we can read the metadata without loading the entire file
pprint(peek_metadata('my_data.parquet'))

Console output: (1/1):

{'returns': [{'desc': 'x to the power 2.',
              'name': 'x_squared',
              'type': 'float'},
             {'desc': 'x to the power 3.', 'name': 'x_cubed', 'type': 'float'},
             {'desc': 'x to the power 4.',
              'name': 'x_fourth',
              'type': 'float'}],
 'summary': ['Adds columns with powers of x (from 2 to 4)']}

Limitations

There is an important detail with this implementation. If you look at the last line in another_transformation, you’ll see that it doesn’t return a pandas.DataFrame, but a pyarrow.Table.

If you want to keep manipulating the data using pandas, you can convert it back to a pandas.DataFrame but you’ll lose the metadata. Make sure you save the metadata after applying all transformations and right before you save it to disk.

Using metadata to track source code version

You can leverage metadata to store other useful information. For example, whenever I generate a data file, I also include the git commit hash, which allows me to find the exact code that generated a given file.

Metadata is a big timesaver that requires little effort

I rarely take a look at the metadata in my files, but I’ve found it extremely useful in certain circumstances:

  1. When sharing a file with someone else, the documentation is part of the file itself
  2. If some analysis goes wrong, it’s easier to track errors down (by using the git hash)
  3. When generating data reports from Jupyter, I add a cell that prints the data documentation to avoid having to copy and paste the docstring

Source code for this post is available here.


Found an error? Click here to let us know.

comments powered by Disqus

Recent Articles

blog-post

Who needs MLflow when you have SQLite?

I spent about six years working as a data scientist and tried to use MLflow several times (and others as well) to track …

Try Ploomber Cloud Now

Get Started
*