.parquet files are completely underrated, many people still do not know about th...

thesz · on May 27, 2024

Parquet is underdesigned. Some parts of it do not scale well.

I believe that Parquet files have rather monolithic metadata at the end and it has 4G max size limit. 600 columns (it is realistic, believe me), and we are at slightly less than 7.2 millions row groups. Give each row group 8K rows and we are limited to 60 billion rows total. It is not much.

The flatness of the file metadata require external data structures to handle it more or less well. You cannot just mmap it and be good. This external data structure most probably will take as much memory as file metadata, or even more. So, 4G+ of your RAM will be, well, used slightly inefficiently.

(block-run-mapped log structured merge tree in one file can be as compact as parquet file and allow for very efficient memory mapped operations without additional data structures)

Thus, while parqet is a step, I am not sure it is a step in definitely right direction. Some aspects of it are good, some are not that good.

Renaud · on May 27, 2024

Parquet is not a database, it's a storage format that allows efficient column reads so you can get just the data you need without having to parse and read the whole file.

Most tools can run queries across parquet files.

Like everything, it has its strengths and weaknesses, but in most cases, it has better trade-offs over CSV if you have more than a few thousand rows.

bsoles · on May 27, 2024

> Parquet is not a database.

This is not emphasized often enough. Parquet is useless for anything that requires writing back computed results as in data used by signal processing applications.

maxnevermind · on May 27, 2024

> 7.2 millions row groups

Why would you need 7.2 mil row groups?

Row group size when stored in HDFS is usually equal to HDFS bock size by default, which is 128MB

7.2 mil * 128MB ~ 1PB

You have a single parquet file 1PB in size?

thesz · on May 27, 2024

Parquet is not HDFS. It is a static format, not a B-tree in disguise like HDFS.

You can have compressed Parquet columns with 8192 entries being a couple of tens bytes in size. 600 columns in a row group is then 12K bytes or so, leading us to 100GB file, not a petabyte. Four orders of magnitude of difference between your assessment and mine.

apwell23 · on May 27, 2024

some critiques of parquet by andy pavlo

https://www.vldb.org/pvldb/vol17/p148-zeng.pdf

thesz · on May 27, 2024

Thanks, very insightful.

"Dictionary Encoding is effective across data types (even for floating-point values) because most real-world data have low NDV ratios. Future formats should continue to apply the technique aggressively, as in Parquet."

So this is not critique, but assessment. And Parquet has some interesting design decisions I did not know about.

So, let me thank you again. ;)

imiric · on May 27, 2024

What format would you recommend instead?

thesz · on May 27, 2024

I do not know a good one.

A former colleague of mine is now working on a memory-mapped log-structured merge tree implementation and it can be a good alternative. LSM provides elasticity, one can store as much data as one needs, it is static, thus it can be compressed as well as Parquet-stored data, memory mapping and implicit indexing of data do not require additional data structures.

Something like LevelDB and/or RocksDB can provide most of that, especially when used in covering index [1] mode.

[1] https://www.sqlite.org/queryplanner.html#_covering_indexes

datadeft · on May 27, 2024

Nobody is forcing you to use a single Parquet file.

thesz · on May 27, 2024

Of course.

But nobody tells me that I can hit a hard limit and then I need a second Parquet file and should have some code for that.

The situation looks to me as if my "Favorite DB server" supports, say, only 1.9 billions records per table and if I hit that limit I need a second instance of my "Favorite DB server" just for that unfortunate table. And it is not documented anywhere.

riku_iki · on May 27, 2024

> They are 50x faster to read than CSV

I actually benchmarked this and duckdb CSV reader is faster than parquet reader.

wenc · on May 27, 2024

I would love to see the benchmarks. That is not my experience, except in the rare case of a linear read (in which CSV is much easier to parse).

CSV underperforms in almost every other domain, like joins, aggregations, filters. Parquet lets you do that lazily without reading the entire Parquet dataset into memory.

riku_iki · on May 27, 2024

> That is not my experience, except in the rare case of a linear read (in which CSV is much easier to parse).

Yes, I think duckdb only reads CSV, then projects necessary data into internal format (which is probably more efficient than parquet, again based on my benchmarks), and does all ops (joins, aggregations) on that format.

wenc · on May 27, 2024

Yes, it does that, assuming you read in the entire CSV, which works for CSVs that fit in memory.

With Parquet you almost never read in the entire dataset and it's fast on all the projections, joins, etc. while living on disk.

riku_iki · on May 27, 2024

> which works for CSVs that fit in memory.

what? Why CSV is required to fit in memory in this case? I tested CSVs which are far larger than memory, and it works just fine.

geysersam · on May 27, 2024

The entire csv doesn't have to fit in memory, but the entire csv has to pass through memory at some point during the processing.

The parquet file has metadata that allows duckdb to only read the parts that are actually used, reducing total amount of data read from disk/network.

riku_iki · on May 27, 2024

> The parquet file has metadata that allows duckdb to only read the parts that are actually used, reducing total amount of data read from disk/network.

this makes sense, and what I hoped to have. But in reality looks like parsing CSV string works faster than bloated and overengineered parquet format with libs.

wenc · on May 27, 2024

>But in reality looks like parsing CSV string works faster than bloated and overengineered parquet format with libs.

Anecdotally having worked with large CSVs and large on-disk Parquet datasets, my experience is the opposite of yours. My DuckDB queries operate directly on Parquet on disk and never load the entire dataset, and is always much faster than the equivalent operation on CSV files.

I think your experience might be due to -- what it sounds like -- parsing the entire CSV into memory first (CREATE TABLE) and then processing after. That is not an apples-to-apples comparison because we usually don't do this with Parquet -- there's no CREATE TABLE step. At most there's a CREATE VIEW, which is lazy.

I've seen your comments bashing Parquet in DuckDB multiple times, and I think you might be doing something wrong.

riku_iki · on May 27, 2024

> I think your experience might be due to -- what it sounds like -- parsing the entire CSV into memory first (CREATE TABLE) and then processing after. That is not an apples-to-apples

original discussion was about CSV vs parquet "reader" part, so this is exactly apple to apple testing, easy to benchmark and I stand my ground. What you are doing downstream, it is another question which is not possible to discuss because no code for your logic is available.

> I've seen your comments bashing Parquet in DuckDB multiple times, and I think you might be doing something wrong.

like running one command from DuckDB doc.

Also, I am not "bashing", I just state that CSV reader is faster.

xnx · on May 27, 2024

For how many rows?

riku_iki · on May 27, 2024

alentred · on May 28, 2024

Agreed. The abstractions on top of parquet are quite immature yet, though, and lots of software assumes that if you use Parquet - you also use Hive, Spark and stuff.

Take Apache Iceberg for example. It is essentially a specification to how to store parquet files for efficient use and exploration of data, but the only implementation... depends on Apache Spark!

swyx · on May 28, 2024

> They are 10x smaller than CSV. So 600GB instead of 6TB.

how? lossless compression? under what scenario?

vague headlines like this just beg more questions

riku_iki · on May 28, 2024

Likely this assumes that parquet has internal compression applied, and CSV is uncompressed.

62951413 · on May 29, 2024

You kind of can peek into parquet files with a tiny command line utility: https://github.com/manojkarthick/pqrs

ddalex · on May 27, 2024

Why is .parquet better than protobuf?

sdenton4 · on May 27, 2024

Parquet is columnar storage, which is much faster for querying. And typically for protobuf you deserialize each row, which has a performance cost - you need to deserialize the whole message, and can't get just the field you want.

So, of you want to query a giant collection of protobufs, you end up reading and deserializing every record. For parquet, you get much closer to only reading what you need.

ddalex · on May 28, 2024

Thank you.

nostrademons · on May 27, 2024

Parquet ~= Dremel, for those who are up on their Google stack.

Dremel was pretty revolutionary when it came out in 2006 - you could run ad-hoc analyses in seconds that previously would've taken a couple days of coding & execution time. Parquet is awesome for the same reasons.

jjgreen · on May 27, 2024

Please promote the use of .parquet files!

  apt-cache search parquet
  <nada>

Maybe later

seabass-labrax · on May 27, 2024

Parquet is a file format, not a piece of software. 'apt install csv' doesn't make any sense either.

fhars · on May 27, 2024

If you want to shine with snide remarks, you should at least understand the point being made:

    $ apt-cache search csv | wc -l
    225
    $ apt-cache search parquet | wc -l
    0

jjgreen · on May 27, 2024

There is no support for parquet in Debian, by contrast

  apt-cache search csv | wc -l
  259

riku_iki · on May 28, 2024

apt search would return tons of libparquet-java/c/python packages if it was popular.

nostrademons · on May 27, 2024

It's more like "sudo pip install pandas" and then Pandas comes with Parquet support.

jjgreen · on May 27, 2024

Pandas cannot read parquet files itself, it uses 3rd party "engines" for that purpose and those are not available in Debian

nostrademons · on May 27, 2024

Ah yes, that's true though a typical Anaconda installation will have them automatically installed. "sudo pip install pyarrow" or "sudo pip install fastparquet" then.

sph · on May 27, 2024

Third consecutive time in 86 days that you mention .parquet files. I am out of my element here, but it's a bit weird

ok_computer · on May 27, 2024

Sometimes when people discover or extensively use something they are eager to share in contexts they think are relevant. There is an issue when those contexts become too broad.

3 times across 3 months is hardly astroturfing for big parquet territory.

fifilura · on May 27, 2024

FWIW I am the same. I tend to recommend BigQuery and AWS/Athena in various posts. Many times paired with Parquet.

But it is because it makes a lot of things much simpler, and that a lot of people have not realized that. Tooling is moving fast in this space, it is not 2004 anymore.

His arguments are still valid and 86 days is a pretty long time.

mrtimo · on May 27, 2024

I've downloaded many csv files that were mal-formatted (extra commas or tabs etc.), or had dates in non-standard formats. Parquet format probably would not have had these issues!

swyx · on May 28, 2024

no need to be so suspicious when its an open standard not even linked to a startup?