Cogs and Levers A blog full of technical stuff


The Apache Parquet file format is used widely in the data space. It’s a column-oriented format that focuses on storing data as efficiently as possible, with emphasis on data retrieval.


The most common storage format that you’d use to hold data is CSV. It’s a good, simple format but has quite a number of short-comings when it comes to data analysis.

Parquet is a column-oriented format which is much more sympathetic to data analysis. CSV is row-oriented, which is a much better application for an OLTP scenario.

Parquet offers compression and partitioning that is simply not available to the CSV format.

So, it stores information - but is better suited to the data warehouse idea.


A really easy way to get started using Parquet is with Python. The PyArrow library is a set of utilities that allow you to work with in-memory analytics tools. PyArrow plays nicely with Pandas and NumPy so it’s a good fit.

Make sure you have pyarrow installed as a dependency.

import pandas as pd
import numpy as np
import pyarrow as pa

import pyarrow.parquet as pq

Create a table

First off, we’ll create a DataFrame from some raw data.

data = [
    ["John", "Smith", 23],
    ["Amber", "Brown", 31],
    ["Mark", "Green", 22],
    ["Jane", "Thomas", 26]

df = pd.DataFrame(
    columns=["first_name", "last_name", "age"]

We can then use from_pandas function to create a pyarrow.Table from this DataFrame.

table = pa.Table.from_pandas(df)

Basic I/O

Now that we have loaded a Table, we can save this data using write_table and pick it back up off disk using read_table.

pq.write_table(table, 'people.parquet')

people = pq.read_table('people.parquet')

Finishing up

There’s a lot more benefits to picking a more optimal data storage solution when working in the data analytics space.