Cogs and Levers A blog full of technical stuff

Parquet

The Apache Parquet file format is used widely in the data space. It’s a column-oriented format that focuses on storing data as efficiently as possible, with emphasis on data retrieval.

Why?

The most common storage format that you’d use to hold data is CSV. It’s a good, simple format but has quite a number of short-comings when it comes to data analysis.

Parquet is a column-oriented format which is much more sympathetic to data analysis. CSV is row-oriented, which is a much better application for an OLTP scenario.

Parquet offers compression and partitioning that is simply not available to the CSV format.

So, it stores information - but is better suited to the data warehouse idea.

Python

A really easy way to get started using Parquet is with Python. The PyArrow library is a set of utilities that allow you to work with in-memory analytics tools. PyArrow plays nicely with Pandas and NumPy so it’s a good fit.

Make sure you have pyarrow installed as a dependency.

import pandas as pd
import numpy as np
import pyarrow as pa

import pyarrow.parquet as pq

Create a table

First off, we’ll create a DataFrame from some raw data.

data = [
    ["John", "Smith", 23],
    ["Amber", "Brown", 31],
    ["Mark", "Green", 22],
    ["Jane", "Thomas", 26]
]

df = pd.DataFrame(
    data, 
    columns=["first_name", "last_name", "age"]
)

We can then use from_pandas function to create a pyarrow.Table from this DataFrame.

table = pa.Table.from_pandas(df)

Basic I/O

Now that we have loaded a Table, we can save this data using write_table and pick it back up off disk using read_table.

pq.write_table(table, 'people.parquet')

people = pq.read_table('people.parquet')
people.to_pandas()

Finishing up

There’s a lot more benefits to picking a more optimal data storage solution when working in the data analytics space.