Datascience in Towards Data Science on Medium,

Saving Pandas DataFrames Efficiently and Quickly — Parquet vs Feather vs ORC vs CSV

11/27/2024 Jesus Santana

Optimisation

Saving Pandas DataFrames Efficiently and Quickly — Parquet vs Feather vs ORC vs CSV

Speed, RAM, size, and convenience. Which storage method is best?

bar chart comparing output file sizes for mixed data in a dataframe for file formats csv, feather, orc and parquet — Write output file sizes for mixed data — Image by Author

With the ever-increasing volume of data that is produced there is inevitably a need to store, and reload, that data efficiently and quickly.

CSV has been the go to staple for a long time. However, there are much better alternatives specifically designed to deal directly with the storage, and efficient re-loading, of tabular data.

So, how much are you losing out if you are still using CSV format for storage of your data tables? And which alternative should you consider?

Introduction

When it comes to storing tabular data the ideal would be:

Fast to write
Fast to read
Low RAM usage
Low storage requirements
Good options for compression

An option to read only part of the data, without loading the whole dataset, would also be an excellent addition to the above.

The list outlined above will therefore form the base of testing some of the more widely used methods against these factors. Specifically:

CSV
Feather
Parquet
ORC

A Brief Outline of the Storage Methods

This section will give a rough overview of each of the storage methods that will be utilised throughout this article. A simple primer, nothing more.

CSV

an image of comma separated numbers going slightly blurry on the right hand side — Photo by Mika Baumeister on Unsplash

CSV (Comma Separated Values) is probably (still!) one of the most widely used methods of storing tabular data.

As the name implies, each row is a list of values separated by commas. Each comma separator indicating a new column, and each new line indicating a new row. Very simple.

There is no easy way to read partial data, especially with regard to columns. So generally all data must be read to access even a small portion of the stored data.

Furthermore, although the name ‘comma’ separated values would imply a consistent standard, in reality this is not the case. Quite frequently semi-colons, tabs, and even plain spaces are used as the separator.

Not to mention potential inconsistencies in character encoding, and handing of a header row. All of which make the implementation of an efficient and repeatable encoding and decoding method more difficult.

Feather

Feather file format is:

…a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. Feather was created early in the Arrow project as a proof of concept for fast, language-agnostic data frame storage for Python (pandas) and R. There are two file format versions for Feather

-arrow.apache.org

Essentially a binary format that saves raw arrow data:

Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

-arrow.apache.org

In essence it ticks a lot of boxes:

Can read specific columns only (i.e. a columnar storage format)
Multi-threading
GPU usage
Compression: zstd or lz4

Parquet

Apache Parquet is:

…an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.

-parquet.apache.org

It would be fair to say that Parquet is more ‘feature rich’ than Feather (I won’t go into detail here), and is generally used more extensively in Big Data. In a basic usage sense they are, however, pretty comparable.

Can read specific columns only (i.e. a columnar storage format)
Multi-threading
GPU usage
Compression: snappy, gzip, brotli, lz4 and zstd

ORC (Optimised Row Columnar)

computer generated image of two mythical creatures (orcs) sat on a bench — Image by Jim Cooper from Pixabay

This is an interesting one.

Released around the same time as Parquet back in 2013, ORC is another columnar based file format. It has been adopted by large institutions such as Facebook, and even has claims such as:

Facebook uses ORC to save tens of petabytes in their data warehouse and demonstrated that ORC is significantly faster than RC File or Parquet.

-orc.apache.org

Now, it should be noted that ORC is generally used with Hadoop and Hive, so in the more basic usage we will be covering here it will be interesting to see what happens in comparison to Parquet and feather.

Can read specific columns only (i.e. a columnar storage format)
Multi-threading
GPU usage
Compression: snappy, zlib, lz4 and zstd

The Plan

The plan is basically to make real world comparisons of these different storage methods using the Pandas library.

Main Comparison

The comparisons will be made on five factors:

Final saved file size
Speed of write
Speed of read
RAM used while reading
RAM used while writing

This will be completed for all of the formats, and all compression methods available to that particular file format:

CSV — No compression, bz2, gzip, tar, xz, zip, zstd
Feather — No compression, lz4, zstd
Parquet — No compression, brotli, gzip, lz4, snappy, zstd
ORC — No compression, lz4, snappy, zlib, zstd

Note: all compression algorithms will be used at their default settings utilised by Pandas. There is the option to fine tune this in some cases, but it will not be covered in this article.

Additional investigation

Since the release of Pandas 2 it has been possible to use PyArrow data types in DataFrames, rather than the NumPy data types that were standard in version 1 of Pandas.

We will also find out whether the new PyArrow datatype available in Pandas 2.0 makes a difference…

Does this make a difference when it come to saving and reading data? We will find out.

The Data

Dummy data has been generated for the various tests. It is deliberately on the larger side of things, resulting in DataFrames that are around 2GB in size when saved as a plain CSV.

In addition, a mixture of data types will be investigated:

DataFrame 1 — Mixed (floats, strings and booleans)

100000 rows and 1200 columns
Uncompressed CSV — 2.27GB

400 columns of floats, strings and booleans, respectively

DataFrame 2— Floats only

100000 rows and 1200 columns
Uncompressed CSV — 2.18GB

Taken from a random uniform distribution between 0 and 100:

floats_array = np.random.default_rng().uniform(low=0,high=100,size=[rows,cols])

DataFrame 3— Strings only

100000 rows and 600 columns
Uncompressed CSV — 1.98GB

Strings are generated as an MD5 hash of the floats used for the floats array, with any “7” present replaced with a space.

Note: this DataFrame has half the columns of the previous DataFrames as strings take up more space. However, final file size is roughly the same.

string_array[:,i] = df_floats.iloc[:,i].apply(lambda x: hashlib.md5(str(x).encode('utf-8')).hexdigest().replace("7"," ")).to_numpy()

DataFrame 4— Booleans only

100000 rows and 6000 columns
Uncompressed CSV — 3.30GB

bool_array = np.random.choice(a=[False, True], size=(rows, cols))

Note: this DataFrame has significantly more columns than the previous DataFrames as booleans don’t take up a lot of space. Final file size is actually about 50% larger than previous DataFrames.

Testing Rig

Just for transparency, this is the setup on the computer that ran all the tests in this article:

CPU — Intel i7–4790K
RAM — 20GB (all tests stayed within RAM, no swap was utilised)
GPU — EVGA GeForce GTX 1070 FTW
SSD — WD SN850 1TB (NVMe)
OS — Arch Linux

This should give you a rough benchmark when looking at the results.

Testing — Mixed Data

Initially, the testing will be broad and show a comparison between all methods with a mixed dataset (floats + strings + booleans).

We will then drill down into more specific testing for the formats coming out on top in later sections.

Note: in general all RAM statistics and write/read times will have slight variation due to it being tested on a live system.

Specifically with regard to RAM, the results are generally reliable as I was able to monitor actual usage through system apps as a comparison. Any small or negative RAM usage values can essentially be treated as zero.

All Methods — File Size — Mixed Data

In terms of generated file sizes it would appear that all formats are relatively comparable if a compression method is used. However, it is clearly a significant advantage to use anything but CSV if you intend not to use any compression methods at all.

Let’s see whether read/write speeds and RAM usage change the picture.

All Methods —Execution Time — Mixed Data

bar chart comparing write times for mixed data in a dataframe for file formats csv, feather, orc and parquet — **Execution time** — All methods — Write test— Image by author

bar chart comparing read times for mixed data in a dataframe for file formats csv, feather, orc and parquet — **Execution time** — All methods — Read test — Image by author

Execution time is where CSV really starts to show it’s inefficiencies. So much so that it is generally (excluding brotli and gzip from the parquet tests) at least four times slower to write, and at least twice as slow to read compared to all other methods, regardless of compression.

As briefly mentioned above, the brotli and gzip algorithms are also woefully slow when it comes to write speed.

All Methods — RAM Usage— Mixed Data

bar chart comparing write RAM usage for mixed data in a dataframe for file formats csv, feather, orc and parquet — **RAM usage** — All methods — Write test — Image by author

bar chart comparing read RAM usage for mixed data in a dataframe for file formats csv, feather, orc and parquet — **RAM usage** — All methods —Read test — Image by author

RAM usage is interesting. CSV performs very well. That is something to bear in mind if you have very stringent RAM requirements, and execution time is not an issue.

However, there are some notable results here. Uncompressed feather is a top performer in both read and write.

With compression, feather and orc perform at a similar level on both read and write. With parquet generally excelling at write, but in some circumstances over utilising RAM in read operations.

Mixed Data — Takeaway

Taking all factors together with regard to mixed data, there are actually some very interesting results.

CSV should be ignored

CSV makes little sense to use when compared to the other methods, mainly due to the extremely slow execution speed when compared to other methods.

The only exception is if you have very stringent RAM requirements.

ZSTD is the superior compression method

It is quite obvious from the results that ZSTD is the standout compression method regardless of the file type that is used.

No matter which metric you look at, it performs well. It has some of the highest compression ratios, whilst also executing quickly, and using a relatively small amount of RAM.

Moving on…

The next few sections will drill down the specifics a little more. CSV will also not feature, as I consider it out of the race at this point. It also makes the graphs clearer!

Specifically, the following will be considered:

Does the datatype matter? A DataFrame full of floats, strings and booleans, respectively, will be tested to see how this compares to mixed data.
Pandas 2.x includes the possibility to use “PyArrow” datatypes. How does this compare to the “Numpy” datatypes that have been used up to this point?

Does the type of data matter?

Starting with point 1 from the previous section, let’s find out if any of the file formats have any advantages, or weaknesses, when specific data is used in the DataFrame.

Filesize

bar chart comparing saved file size for string data in a dataframe for file formats feather, orc and parquet — **Saved file sizes** — String data — Image by author

bar chart comparing saved file size for boolean data in a dataframe for file formats feather, orc and parquet — **Saved file sizes** — Boolean data — Image by author

In terms of file size, across the board ZSTD has the superior compression ratio.

ORC comes out on top, but in reality feather and parquet are roughly comparable, with the notable exception of parquet when it comes to float data.

Execution Speed — Write

bar chart comparing execution speed for write operations for mixed data in a dataframe for file formats feather, orc and parquet — **Execution time** — Write — Mixed data — Image by author

bar chart comparing execution speed for write operations for float data in a dataframe for file formats feather, orc and parquet — **Execution time** — Write — Float data — Image by author

bar chart comparing execution speed for write operations for string data in a dataframe for file formats feather, orc and parquet — **Execution time** — Write —String data — Image by author

bar chart comparing execution speed for write operations for boolean data in a dataframe for file formats feather, orc and parquet — **Execution time** — Write — Boolean data — Image by author

Write speed is quite interesting.

The standout result is that feather is at least twice as fast as the other methods, and sometimes significantly more. Boolean data seems to be a particular problem for ORC and parquet when compared to feather.

Float data again appears to be a weakness for parquet, and the same can be said for ORC with regard to boolean data.

A really mixed bag, but a convincing win for feather.

Execution Speed — Read

bar chart comparing execution speed for read operations for mixed data in a dataframe for file formats feather, orc and parquet — **Execution time** — Read — Mixed data — Image by author

bar chart comparing execution speed for read operations for float data in a dataframe for file formats feather, orc and parquet — **Execution time** — Read— Float data — Image by author

bar chart comparing execution speed for read operations for string data in a dataframe for file formats feather, orc and parquet — **Execution time** — Read — String data — Image by author

bar chart comparing execution speed for read operations for boolean data in a dataframe for file formats feather, orc and parquet — **Execution time** — Read — Boolean data — Image by author

All in all, read time is pretty fast and consistent between the different file types across all datatypes. Feather has a slight advantage across the board, but nothing too extreme.

RAM usage — Write

bar chart comparing RAM usage for write operations for mixed data in a dataframe for file formats feather, orc and parquet — **RAM usage** — Write — Mixed data — Image by author

bar chart comparing RAM usage for write operations for float data in a dataframe for file formats feather, orc and parquet — **RAM usage** — Write —Float data — Image by author

bar chart comparing RAM usage for write operations for string data in a dataframe for file formats feather, orc and parquet — **RAM usage** — Write — String data — Image by author

bar chart comparing RAM usage for write operations for boolean data in a dataframe for file formats feather, orc and parquet — **RAM usage** — Write — Boolean data — Image by author

In terms of RAM usage on write, string data is consistent for all filetypes. However, when it comes to float or boolean data there is a significant advantage to be had by using parquet or uncompressed feather.

ORC is definately a little bit behind the pack for RAM write usage.

RAM usage — Read

bar chart comparing RAM usage for read operations for mixed data in a dataframe for file formats feather, orc and parquet — **RAM usage** — Read — Mixed data — Image by author

bar chart comparing RAM usage for read operations for float data in a dataframe for file formats feather, orc and parquet — **RAM usage** — Read —Float data — Image by author

bar chart comparing RAM usage for read operations for string data in a dataframe for file formats feather, orc and parquet — **RAM usage** — Read —String data — Image by author

bar chart comparing RAM usage for read operations for boolean data in a dataframe for file formats feather, orc and parquet — **RAM usage** — Read —Boolean data — Image by author

The RAM usage for read is a fairly mixed bag, and all file types perform roughly equivalently.

Some exceptions to this are feather data without compression which has no RAM overhead at all, and float data when utilising ORC.

Pandas 2.x with PyArrow Datatypes

One of the significant upgrades that came with the release of Pandas 2 is the inclusion of the ability to utilise PyArrow datatypes directly in DataFrames.

For example if I was to load a feather file using numpy datatypes:

df_numpy = pd.read_feather('dataframe.feather.zstd', dtype_backend='numpy_nullable')

…or I could use the new PyArrow backend, and load PyArrow datatypes:

df_pyarrow = pd.read_feather('dataframe.feather.zstd', dtype_backend='pyarrow')

As some of the formats, such as parquet and feather are based on PyArrow to some extent, this gives the potential of having much improved performance.

Let’s see what sort of difference it really makes.

PyArrow Comparison

As ZSTD compression has been consistently outperforming all other compression methods throughout this article. The next set of comparisons will exclusively use ZSTD as compression.

bar chart comparing saved file size for numpy and pyarrow datatypes with zstd compression for mixed data in a dataframe for file formats csv, feather, orc and parquet — **Saved file sizes** — Mixed data— Image by author

As would be expected there is no difference between the saved file sizes, so let’s move onto the execution time.

Execution time

bar chart comparing write execution speed for numpy and pyarrow datatypes with zstd compression for mixed data in a dataframe for file formats csv, feather, orc and parquet — **Execution time** — Write — Mixed data — Image by author

bar chart comparing read execution speed for numpy and pyarrow datatypes with zstd compression for mixed data in a dataframe for file formats csv, feather, orc and parquet — **Execution time** — Read — Mixed data — Image by author

As is quite clear there is a significant advantage in both read and write time when using PyArrow datatypes. This is especially prevalent with ORC and feather file formats, which are at least twice as fast as when using numpy datatypes.

RAM usage

bar chart comparing write RAM utilisation for numpy and pyarrow datatypes with zstd compression for mixed data in a dataframe for file formats csv, feather, orc and parquet — **RAM usage** — Write — Mixed data — Image by author

bar chart comparing read RAM utilisation for numpy and pyarrow datatypes with zstd compression for mixed data in a dataframe for file formats csv, feather, orc and parquet — **RAM usage** — Read — Mixed data — Image by author

RAM usage is a mixed bag, with a significant advantage when it comes to write, but slightly worse read performance.

Quick analysis

A cartoon man holding a magnifying glass and smiling against a plain green background — Image by Gerd Altmann from Pixabay

In general CSV can be discarded. The only real exception is if RAM is very restricted in your particular use case, in which case, the severe disadvantage in execution speed may be worth it.

In general CSV can be discarded

However, generally it comes down to a choice between feather, ORC and Parquet.

Feather

Feather is generally an excellent choice. It outperforms the other formats quite consistently on read/write speed, and especially with regard to RAM usage on read if no compression is used.

[Feather] outperforms the other formats quite consistently on read/write speed

Feather is generally considered to be less feature rich than the other formats, but is obviously worth considering if speed of execution and RAM are primary concerns.

ORC

ORC is also a top performer. It doesn’t quite outdo feather in some categories, but is close enough to be relevant. More significantly, it consistently produces the highest compression ratio with ZSTD under all conditions.

[ORC] consistently produces the highest compression ratio

If ultimate compression ratio is your goal, without sacrificing too much in other areas, then ORC is a good bet.

Parquet

Parquet is generally considered a more feature rich version of feather, and is widely used in industry. It performs on a very similar level to ORC, sometimes slightly better, sometimes slightly worse. Although I would say ORC has a slight edge overall.

[Parquet] performs on a very similar level to ORC

Again, definitely worthy of consideration if your specific use case requires Parquets particular features.

Conclusion

the word advice written in chalk on a black background — Photo by Anna Tarazevich from Pexels

To keep things simple this is what I recommend based on the results of this article:

Only use CSV if the data you have will cause problems with RAM on read and write
Always use PyArrow datatype rather than the numpy datatype if you can
Always use ZSTD compression as it will result in high compression whilst also being fast
Feather is generally the best performer, so use that if you can
If you need the additional features of Parquet or ORC, then use one of those, you won’t lose anything significant compared to feather
If you can’t use feather, but can freely choose between ORC and Parquet, go with ORC as it is generally slightly more performant

Remember, the results in this article are specific to the data specified earlier in the article. You may find for much smaller DataFrames, or even much larger DataFrames, the results vary from what is presented here.

However, I hope you now have a decent amount of data to make an informed decision as to which format suits your specific situation best.

Saving Pandas DataFrames Efficiently and Quickly — Parquet vs Feather vs ORC vs CSV was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Datascience in Towards Data Science on Medium https://ift.tt/UeRbJYr
via IFTTT

Saving Pandas DataFrames Efficiently and Quickly — Parquet vs Feather vs ORC vs CSV 11/27/2024 Jesus Santana

Optimisation

Saving Pandas DataFrames Efficiently and Quickly — Parquet vs Feather vs ORC vs CSV

Speed, RAM, size, and convenience. Which storage method is best?

Introduction

A Brief Outline of the Storage Methods

CSV

Feather

Parquet

ORC (Optimised Row Columnar)

The Plan

Main Comparison

Additional investigation

The Data

DataFrame 1 — Mixed (floats, strings and booleans)

DataFrame 2— Floats only

DataFrame 3— Strings only

DataFrame 4— Booleans only

Testing Rig

Testing — Mixed Data

All Methods — File Size — Mixed Data

All Methods —Execution Time — Mixed Data

All Methods — RAM Usage— Mixed Data

Mixed Data — Takeaway

Moving on…

Does the type of data matter?

Filesize

Execution Speed — Write

Execution Speed — Read

RAM usage — Write

RAM usage — Read

Pandas 2.x with PyArrow Datatypes

PyArrow Comparison

Quick analysis

Feather

ORC

Parquet

Conclusion

Jesús Santana

You Might Also Like

Translate

Follow us

Subscribe To

Sponsor

Saving Pandas DataFrames Efficiently and Quickly — Parquet vs Feather vs ORC vs CSV

11/27/2024 Jesus Santana