Parquet
06 Jul 2023The Apache Parquet file format is used widely in the data space. It’s a column-oriented format that focuses on storing data as efficiently as possible, with emphasis on data retrieval.
Why?
The most common storage format that you’d use to hold data is CSV. It’s a good, simple format but has quite a number of short-comings when it comes to data analysis.
Parquet is a column-oriented format which is much more sympathetic to data analysis. CSV is row-oriented, which is a much better application for an OLTP scenario.
Parquet offers compression and partitioning that is simply not available to the CSV format.
So, it stores information - but is better suited to the data warehouse idea.
Python
A really easy way to get started using Parquet is with Python. The PyArrow library is a set of utilities that allow you to work with in-memory analytics tools. PyArrow plays nicely with Pandas and NumPy so it’s a good fit.
Make sure you have pyarrow
installed as a dependency.
Create a table
First off, we’ll create a DataFrame
from some raw data.
We can then use from_pandas
function to create a pyarrow.Table
from this DataFrame
.
Basic I/O
Now that we have loaded a Table
, we can save this data using write_table
and pick it back up off disk using read_table
.
Finishing up
There’s a lot more benefits to picking a more optimal data storage solution when working in the data analytics space.