Parquet is underdesigned. Some parts of it do not scale well.
I believe that Parquet files have rather monolithic metadata at the end and it has 4G max size limit. 600 columns (it is realistic, believe me), and we are at slightly less than 7.2 millions row groups. Give each row group 8K rows and we are limited to 60 billion rows total. It is not much.
The flatness of the file metadata require external data structures to handle it more or less well. You cannot just mmap it and be good. This external data structure most probably will take as much memory as file metadata, or even more. So, 4G+ of your RAM will be, well, used slightly inefficiently.
(block-run-mapped log structured merge tree in one file can be as compact as parquet file and allow for very efficient memory mapped operations without additional data structures)
Thus, while parqet is a step, I am not sure it is a step in definitely right direction. Some aspects of it are good, some are not that good.
Parquet is not a database, it's a storage format that allows efficient column reads so you can get just the data you need without having to parse and read the whole file.
Most tools can run queries across parquet files.
Like everything, it has its strengths and weaknesses, but in most cases, it has better trade-offs over CSV if you have more than a few thousand rows.
This is not emphasized often enough. Parquet is useless for anything that requires writing back computed results as in data used by signal processing applications.
Parquet is not HDFS. It is a static format, not a B-tree in disguise like HDFS.
You can have compressed Parquet columns with 8192 entries being a couple of tens bytes in size. 600 columns in a row group is then 12K bytes or so, leading us to 100GB file, not a petabyte. Four orders of magnitude of difference between your assessment and mine.
"Dictionary Encoding is effective across data types (even for floating-point values) because most real-world data have low NDV ratios. Future formats should continue to apply the technique aggressively, as in Parquet."
So this is not critique, but assessment. And Parquet has some interesting design decisions I did not know about.
A former colleague of mine is now working on a memory-mapped log-structured merge tree implementation and it can be a good alternative. LSM provides elasticity, one can store as much data as one needs, it is static, thus it can be compressed as well as Parquet-stored data, memory mapping and implicit indexing of data do not require additional data structures.
Something like LevelDB and/or RocksDB can provide most of that, especially when used in covering index [1] mode.
But nobody tells me that I can hit a hard limit and then I need a second Parquet file and should have some code for that.
The situation looks to me as if my "Favorite DB server" supports, say, only 1.9 billions records per table and if I hit that limit I need a second instance of my "Favorite DB server" just for that unfortunate table. And it is not documented anywhere.
I would love to see the benchmarks. That is not my experience, except in the rare case of a linear read (in which CSV is much easier to parse).
CSV underperforms in almost every other domain, like joins, aggregations, filters. Parquet lets you do that lazily without reading the entire Parquet dataset into memory.
> That is not my experience, except in the rare case of a linear read (in which CSV is much easier to parse).
Yes, I think duckdb only reads CSV, then projects necessary data into internal format (which is probably more efficient than parquet, again based on my benchmarks), and does all ops (joins, aggregations) on that format.
> The parquet file has metadata that allows duckdb to only read the parts that are actually used, reducing total amount of data read from disk/network.
this makes sense, and what I hoped to have. But in reality looks like parsing CSV string works faster than bloated and overengineered parquet format with libs.
>But in reality looks like parsing CSV string works faster than bloated and overengineered parquet format with libs.
Anecdotally having worked with large CSVs and large on-disk Parquet datasets, my experience is the opposite of yours. My DuckDB queries operate directly on Parquet on disk and never load the entire dataset, and is always much faster than the equivalent operation on CSV files.
I think your experience might be due to -- what it sounds like -- parsing the entire CSV into memory first (CREATE TABLE) and then processing after. That is not an apples-to-apples comparison because we usually don't do this with Parquet -- there's no CREATE TABLE step. At most there's a CREATE VIEW, which is lazy.
I've seen your comments bashing Parquet in DuckDB multiple times, and I think you might be doing something wrong.
> I think your experience might be due to -- what it sounds like -- parsing the entire CSV into memory first (CREATE TABLE) and then processing after. That is not an apples-to-apples
original discussion was about CSV vs parquet "reader" part, so this is exactly apple to apple testing, easy to benchmark and I stand my ground. What you are doing downstream, it is another question which is not possible to discuss because no code for your logic is available.
> I've seen your comments bashing Parquet in DuckDB multiple times, and I think you might be doing something wrong.
like running one command from DuckDB doc.
Also, I am not "bashing", I just state that CSV reader is faster.
Agreed. The abstractions on top of parquet are quite immature yet, though, and lots of software assumes that if you use Parquet - you also use Hive, Spark and stuff.
Take Apache Iceberg for example. It is essentially a specification to how to store parquet files for efficient use and exploration of data, but the only implementation... depends on Apache Spark!
Parquet is columnar storage, which is much faster for querying. And typically for protobuf you deserialize each row, which has a performance cost - you need to deserialize the whole message, and can't get just the field you want.
So, of you want to query a giant collection of protobufs, you end up reading and deserializing every record. For parquet, you get much closer to only reading what you need.
Parquet ~= Dremel, for those who are up on their Google stack.
Dremel was pretty revolutionary when it came out in 2006 - you could run ad-hoc analyses in seconds that previously would've taken a couple days of coding & execution time. Parquet is awesome for the same reasons.
Ah yes, that's true though a typical Anaconda installation will have them automatically installed. "sudo pip install pyarrow" or "sudo pip install fastparquet" then.
Sometimes when people discover or extensively use something they are eager to share in contexts they think are relevant. There is an issue when those contexts become too broad.
3 times across 3 months is hardly astroturfing for big parquet territory.
FWIW I am the same. I tend to recommend BigQuery and AWS/Athena in various posts. Many times paired with Parquet.
But it is because it makes a lot of things much simpler, and that a lot of people have not realized that. Tooling is moving fast in this space, it is not 2004 anymore.
His arguments are still valid and 86 days is a pretty long time.
I've downloaded many csv files that were mal-formatted (extra commas or tabs etc.), or had dates in non-standard formats. Parquet format probably would not have had these issues!
.parquet preserves data types (unlike CSV)
They are 10x smaller than CSV. So 600GB instead of 6TB.
They are 50x faster to read than CSV
They are an "open standard" from Apache Foundation
Of course, you can't peek inside them as easily as you can a CSV. But, the tradeoffs are worth it!
Please promote the use of .parquet files! Make .parquet files available for download everywhere .csv is available!