Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

.parquet files are completely underrated, many people still do not know about the format!

.parquet preserves data types (unlike CSV)

They are 10x smaller than CSV. So 600GB instead of 6TB.

They are 50x faster to read than CSV

They are an "open standard" from Apache Foundation

Of course, you can't peek inside them as easily as you can a CSV. But, the tradeoffs are worth it!

Please promote the use of .parquet files! Make .parquet files available for download everywhere .csv is available!



Parquet is underdesigned. Some parts of it do not scale well.

I believe that Parquet files have rather monolithic metadata at the end and it has 4G max size limit. 600 columns (it is realistic, believe me), and we are at slightly less than 7.2 millions row groups. Give each row group 8K rows and we are limited to 60 billion rows total. It is not much.

The flatness of the file metadata require external data structures to handle it more or less well. You cannot just mmap it and be good. This external data structure most probably will take as much memory as file metadata, or even more. So, 4G+ of your RAM will be, well, used slightly inefficiently.

(block-run-mapped log structured merge tree in one file can be as compact as parquet file and allow for very efficient memory mapped operations without additional data structures)

Thus, while parqet is a step, I am not sure it is a step in definitely right direction. Some aspects of it are good, some are not that good.


Parquet is not a database, it's a storage format that allows efficient column reads so you can get just the data you need without having to parse and read the whole file.

Most tools can run queries across parquet files.

Like everything, it has its strengths and weaknesses, but in most cases, it has better trade-offs over CSV if you have more than a few thousand rows.


> Parquet is not a database.

This is not emphasized often enough. Parquet is useless for anything that requires writing back computed results as in data used by signal processing applications.


> 7.2 millions row groups

Why would you need 7.2 mil row groups?

Row group size when stored in HDFS is usually equal to HDFS bock size by default, which is 128MB

7.2 mil * 128MB ~ 1PB

You have a single parquet file 1PB in size?


Parquet is not HDFS. It is a static format, not a B-tree in disguise like HDFS.

You can have compressed Parquet columns with 8192 entries being a couple of tens bytes in size. 600 columns in a row group is then 12K bytes or so, leading us to 100GB file, not a petabyte. Four orders of magnitude of difference between your assessment and mine.


some critiques of parquet by andy pavlo

https://www.vldb.org/pvldb/vol17/p148-zeng.pdf


Thanks, very insightful.

"Dictionary Encoding is effective across data types (even for floating-point values) because most real-world data have low NDV ratios. Future formats should continue to apply the technique aggressively, as in Parquet."

So this is not critique, but assessment. And Parquet has some interesting design decisions I did not know about.

So, let me thank you again. ;)


What format would you recommend instead?


I do not know a good one.

A former colleague of mine is now working on a memory-mapped log-structured merge tree implementation and it can be a good alternative. LSM provides elasticity, one can store as much data as one needs, it is static, thus it can be compressed as well as Parquet-stored data, memory mapping and implicit indexing of data do not require additional data structures.

Something like LevelDB and/or RocksDB can provide most of that, especially when used in covering index [1] mode.

[1] https://www.sqlite.org/queryplanner.html#_covering_indexes


Nobody is forcing you to use a single Parquet file.


Of course.

But nobody tells me that I can hit a hard limit and then I need a second Parquet file and should have some code for that.

The situation looks to me as if my "Favorite DB server" supports, say, only 1.9 billions records per table and if I hit that limit I need a second instance of my "Favorite DB server" just for that unfortunate table. And it is not documented anywhere.


> They are 50x faster to read than CSV

I actually benchmarked this and duckdb CSV reader is faster than parquet reader.


I would love to see the benchmarks. That is not my experience, except in the rare case of a linear read (in which CSV is much easier to parse).

CSV underperforms in almost every other domain, like joins, aggregations, filters. Parquet lets you do that lazily without reading the entire Parquet dataset into memory.


> That is not my experience, except in the rare case of a linear read (in which CSV is much easier to parse).

Yes, I think duckdb only reads CSV, then projects necessary data into internal format (which is probably more efficient than parquet, again based on my benchmarks), and does all ops (joins, aggregations) on that format.


Yes, it does that, assuming you read in the entire CSV, which works for CSVs that fit in memory.

With Parquet you almost never read in the entire dataset and it's fast on all the projections, joins, etc. while living on disk.


> which works for CSVs that fit in memory.

what? Why CSV is required to fit in memory in this case? I tested CSVs which are far larger than memory, and it works just fine.


The entire csv doesn't have to fit in memory, but the entire csv has to pass through memory at some point during the processing.

The parquet file has metadata that allows duckdb to only read the parts that are actually used, reducing total amount of data read from disk/network.


> The parquet file has metadata that allows duckdb to only read the parts that are actually used, reducing total amount of data read from disk/network.

this makes sense, and what I hoped to have. But in reality looks like parsing CSV string works faster than bloated and overengineered parquet format with libs.


>But in reality looks like parsing CSV string works faster than bloated and overengineered parquet format with libs.

Anecdotally having worked with large CSVs and large on-disk Parquet datasets, my experience is the opposite of yours. My DuckDB queries operate directly on Parquet on disk and never load the entire dataset, and is always much faster than the equivalent operation on CSV files.

I think your experience might be due to -- what it sounds like -- parsing the entire CSV into memory first (CREATE TABLE) and then processing after. That is not an apples-to-apples comparison because we usually don't do this with Parquet -- there's no CREATE TABLE step. At most there's a CREATE VIEW, which is lazy.

I've seen your comments bashing Parquet in DuckDB multiple times, and I think you might be doing something wrong.


> I think your experience might be due to -- what it sounds like -- parsing the entire CSV into memory first (CREATE TABLE) and then processing after. That is not an apples-to-apples

original discussion was about CSV vs parquet "reader" part, so this is exactly apple to apple testing, easy to benchmark and I stand my ground. What you are doing downstream, it is another question which is not possible to discuss because no code for your logic is available.

> I've seen your comments bashing Parquet in DuckDB multiple times, and I think you might be doing something wrong.

like running one command from DuckDB doc.

Also, I am not "bashing", I just state that CSV reader is faster.


For how many rows?


10B


Agreed. The abstractions on top of parquet are quite immature yet, though, and lots of software assumes that if you use Parquet - you also use Hive, Spark and stuff.

Take Apache Iceberg for example. It is essentially a specification to how to store parquet files for efficient use and exploration of data, but the only implementation... depends on Apache Spark!


> They are 10x smaller than CSV. So 600GB instead of 6TB.

how? lossless compression? under what scenario?

vague headlines like this just beg more questions


Likely this assumes that parquet has internal compression applied, and CSV is uncompressed.


You kind of can peek into parquet files with a tiny command line utility: https://github.com/manojkarthick/pqrs


Why is .parquet better than protobuf?


Parquet is columnar storage, which is much faster for querying. And typically for protobuf you deserialize each row, which has a performance cost - you need to deserialize the whole message, and can't get just the field you want.

So, of you want to query a giant collection of protobufs, you end up reading and deserializing every record. For parquet, you get much closer to only reading what you need.


Thank you.


Parquet ~= Dremel, for those who are up on their Google stack.

Dremel was pretty revolutionary when it came out in 2006 - you could run ad-hoc analyses in seconds that previously would've taken a couple days of coding & execution time. Parquet is awesome for the same reasons.


Please promote the use of .parquet files!

  apt-cache search parquet
  <nada>
Maybe later


Parquet is a file format, not a piece of software. 'apt install csv' doesn't make any sense either.


If you want to shine with snide remarks, you should at least understand the point being made:

    $ apt-cache search csv | wc -l
    225
    $ apt-cache search parquet | wc -l
    0


There is no support for parquet in Debian, by contrast

  apt-cache search csv | wc -l
  259


apt search would return tons of libparquet-java/c/python packages if it was popular.


It's more like "sudo pip install pandas" and then Pandas comes with Parquet support.


Pandas cannot read parquet files itself, it uses 3rd party "engines" for that purpose and those are not available in Debian


Ah yes, that's true though a typical Anaconda installation will have them automatically installed. "sudo pip install pyarrow" or "sudo pip install fastparquet" then.


Third consecutive time in 86 days that you mention .parquet files. I am out of my element here, but it's a bit weird


Sometimes when people discover or extensively use something they are eager to share in contexts they think are relevant. There is an issue when those contexts become too broad.

3 times across 3 months is hardly astroturfing for big parquet territory.


FWIW I am the same. I tend to recommend BigQuery and AWS/Athena in various posts. Many times paired with Parquet.

But it is because it makes a lot of things much simpler, and that a lot of people have not realized that. Tooling is moving fast in this space, it is not 2004 anymore.

His arguments are still valid and 86 days is a pretty long time.


I've downloaded many csv files that were mal-formatted (extra commas or tabs etc.), or had dates in non-standard formats. Parquet format probably would not have had these issues!


no need to be so suspicious when its an open standard not even linked to a startup?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: