Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ParaText: CSV parsing at 2.5 GB per second (wise.io)
152 points by flashman on June 8, 2016 | hide | past | favorite | 42 comments


This is impressive, but...

"A fast reader exploits the capabilities of the storage system"...

the graphs show that their storage system is doing 4.00 GB/sec

I wonder what processor this is running on and what their storage system this is.. multiple PCIe SSD?

I tried running a quick test but only succeeded in OOMing my 8G laptop.

Even just doing

  import paratext
  it = paratext.load_csv_as_iterator("/dev/shm/tmp/c.log", expand=True, forget=True)
  x = it.next()
Starts eating up all my ram after about a minute of spinning the cpu... so I think they have a slightly different definition of an iterator as everyone else.

Compared to

  cut -d , -f 5 < c.log > /dev/null
which runs in a few seconds, or a slightly more domain specific and optimized version of 'cut'[1] that runs even faster (300-500MB/s on a single core depending on which fields you want)

  $ du -hs c.log ;wc -l c.log 
  2.1G	c.log
  16197412 c.log
I also wonder if that is 2.5 GB/s per core.

https://github.com/BurntSushi/rust-csv does 241 MB/s in raw mode, so I find it a little hard to believe that this is 10x faster... unless that is while maxing out multiple cores.

[1] https://github.com/bro/bro-aux/blob/master/bro-cut/bro-cut.c


From looking at their source briefly, I can make two observations:

Firstly, their core parsing routine is a state machine not unlike the one found in rust-csv: https://github.com/wiseio/paratext/blob/master/src/csv/rowba...

Secondly, they do indeed appear to be achieving high parsing speed using coarsely grained parallelism. You can see the code that chunks up the CSV data here: https://github.com/wiseio/paratext/blob/master/src/generic/c...

It's pretty clever!


Can this be used to speed up rust-csv?


It feels like something that should be built on top of rust-csv.


I thought this was base ability of the parser to parse CSV in chunks.


Does cut work with quoted fields correctly? My understanding was that it was just a dumb line tokenizer and CSV is a little more complex than that, e.g., CSV rows can span more than one line when fields contain line breaks.


No, it doesn't and therefore is no viable option for many real-world datasets.


Which means an effective thing to do when you control the data pipeline is to ban tabs and line-breaks from your values, and then use `cut` on tab-separated files.


Sure, if you can redefine your data format, you can take shortcuts. But otherwise comparing cut to a CSV parser is just a different way of saying »if it doesn't have to be correct, I can make it as fast as I want«.


Or just not use CSV as a data transfer format :D


I think you're missing the point. Unix tools like `cut` are really effective ways to deal with plain text data, who cares if it's CSV.


> 4.00 GB/sec

> I wonder what processor this is running on and what their storage system this is.. multiple PCIe SSD?

Consumer grade NVMe disks currently achiveve 2GB/s. It is easy to use a couple of those, or just one professional SSD.


> Starts eating up all my ram after about a minute of spinning the cpu... so I think they have a slightly different definition of an iterator as everyone else.

I don't think so, the difference is you're expecting it to read one row at a time and it's actually reading one column at a time. load_csv_as_iterator is iterating over the columns.


> I wonder what processor this is running on and what their storage system this is.. multiple PCIe SSD?

One high-end PCIe SSD can manage that; two easily could. A high-end NAS might, too.


It really makes me sad that CSV even exists: ASCII defines field ('unit') & record separator characters (also group & file, but those are less-useful), as well as an escape character. With those few characters, all of the mess of CSV encoding could be solved with these few rules:

    - all records are separated by an RS character (#x1e)
    - all fields within a file are separated by a US character (#x1f)
    - all instances of RS, US & ESC within a field are prefixed with an ESC (#x1b)
    - there are no more rules
It's remarkable to me that ASCII defines a pretty full-featured mechanism for information interchange (start of header, file transfer &c.) and instead we continue to build mechanisms atop its alphabetic characters.

It's like the original sin of computing is Not Invented Here (no doubt someone will pipe up with a story of how ASCII itself was the product of NIH!).


I used to religiously believe in this, but in practice it isn't useful. The whole point is to be roughly human-readable and using non-printing characters defeats that. You can't even easily enter these things via the command line.

If we're abandoning human-readability, why even bother with ASCII? Just use a binary format. Has anyone actually used ASCII unit and record separator delimiters successfully? I'd be curious about what advantages they had over a binary format, even just a protobuf or Thrift serialized form. If we want to preserve schemalessness, there's stuff like Sereal.


> Has anyone actually used ASCII unit and record separator delimiters successfully?

At my first job, the (proprietary) database server we used had a communication protocol that used the ASCII STX/ETX/EOT/ACK/ETB/GS/RS/US characters. When I looked up those ASCII codes I was like "Oh, that's clever", but in practice, it was just another proprietary binary format, and had all the problems of proprietary binary formats. The control codes still needed to be escaped in source code; they still weren't visible in log files or when viewing console output. They were basically just bytes we had to send to implement the protocol.


> The whole point is to be roughly human-readable and using non-printing characters defeats that.

You're assuming that separator characters are not human-readable. If ASCII had been used as originally intended, they'd be just as readable as line breaks. Err, carriage returns.

Computing depresses me some days …


Most people build systems to work in the world we have, not the world we wish we had.

There are all sorts of examples like this. UNIX was initially designed to work with pipes of line-oriented streams, so why did we get scripting languages where every UNIX command is reinvented as a function call or RPC frameworks where the pipe is replaced by a binary message? PHP was initially a templating language, so why did we get Smarty, Wordpress, and PEAR templates? The web was supposed to come with full support for editing & creating pages via WYSIWYG editor and have a built in mechanism (hyperlinks) for associating pages with other people, so why did we need Facebook to introduce the idea of "sharing content with other people".

In each case, there were real, pragmatic reasons that people invented new systems instead of doing what they were "supposed" to do. Line-oriented files are clumsy for representing hierarchical data or conditionals. PHP is too hard to use for most end-users, despite being built for pragmatic "just toss a webpage up" use. The editing features in the web disappeared early on, with Netscape, and they needed the critical mass of college students that Facebook provided before people felt they had an audience for anything they did.

The moral for system designers is that you can't just throw a feature out there and say "Use this." You have to adapt it to how people actually do use it, even if that usage seems brain-dead to you.


One of the problems with this mechanism is data entry. It requires specialized software to insert an RS character instead of a newline ('\n') when going to the next row, for example.

After it has been saved in that format, it is a text file consisting of one very long line (unless some of the fields have newlines, but that doesn't represent the actual structure). This makes it inconvenient to browse with a text editor or pager. Sure, you could have a special mode to make them show up as newlines, but then you can't tell the difference between real newlines and newlines inserted for convenience.

Consequently, Unix text-processing tools like cut, sed, and awk use a line feed ('\n') as a record delimiter, unless there is a special reason not to (e.g. the -print0 flag on find).

Related discussion here:

https://news.ycombinator.com/item?id=7474600


I regularly use non-printable ASCII characters as delimiters. I just confirm first that they are not in the input. I am going to start using the ones you mentioned. This solution is one I rarely if ever see used. The other solution which I have seen in practice is to encode the comma character. The problem really IMO is that people who make CSV place no limits on what range of characters is stored in the tables. Certain characters can safely be removed. In most cases the input is still comprehensible without commas. Commas can be deleted, freeing up their use exclusively as record separators.


The main advantage of CSV is that it is human-readable. But if you have this much data, then why not use some binary format that you can just read in as a blob, and don't even need to parse?

Also, the approach mentioned here is only useful if reading+parsing the data is the bottleneck (or close to it). For example, if reading+parsing takes only 10% of the total processing time, then optimizing this stage will only give at most a 11% increase in performance.


Csv is an ubiquitous data format. Especially when you're receiving data from external sources which you don't have control over, which often happens to be the case in banking and finance (streaming csv dumps are quite common)


CSV is stupid format. XML, JSON, YAML are better to read and easier to parse/validate. After compression these formats takes similar amount of space or even less when cross-referencing.

CSV, tab-delimited files and in general flat files are retarded idea to be used in banking. I know because I work for bank, these are source of all misery.


Adding to what bladecatcher said, even if you want to work internally with a binary format, the fact that you receive large csv datadumps from the outside means you'll have to transform them into your internal format of choice. The first part of that transformation would greatly benefit from a fast csv parser such as ParaText.

Another datapoint: data changes (price updates, number ranges) between European telecom operators are all csv files over ftp. You can recognize the security aware by their use of sftp!


"For example, if reading+parsing takes only 10% of the total processing time, then optimizing this stage will only give at most a 11% increase in performance."

10% performance increase is nothing to scoff at in a mature code base.


When looking at the graphs, remember they are in log scale. The results are a lot more impressive than they look at first glance.


Woah, thanks for that warning. I was wondering when NumPy got so fast for a second!


> Despite extensive use of distributed databases and filesystems in data-driven workflows, there remains a persistent need to rapidly read text files on single machines.

This comment makes me wonder how often a distributed approach is used out of some odd sense of convenience or interest, instead of taking time to create an optimized single node approach?

In other words, how often is an Hadoop-like system used when unnecessary, adding unnecessary complexity?


Did the authors benchmark against kdb?

e.g., run a single k interpeter on each CPU, then divide and conquer

Isn't it faster to do parsing in memory and avoid I/O wherever possible?


Best of three, on an i7 8GB ram:

Loading a 156mb csv file in kdb 32-bit free version, single thread:

    \t trade:`sym`time`ex`cond`size`price!("STCCXH";",")0:`t.csv
    1850
in paratext, 64-bit:

    >>> timeit.timeit('paratext.load_csv_to_dict("t.csv",num_threads=4)', setup="import paratext", number=1)
    3.1176819801330566
No improvement.

Loading a 1.5GB csv file in kdb:

    \t quote:`sym`time`ex`bid`bsize`ask`asize`mode!("STCHXHXC";",")0:`q.csv
    14135
in paratext:

    >>> timeit.timeit('paratext.load_csv_to_dict("q.csv",num_threads=4)', setup="import paratext", number=1)
    12.962939977645874
Not too shabby! An almost 10% improvement over KDB by turning my fans on and burning my lap!

However, I think they should probably make their parser faster before they waste heat trying to make slow code finish sooner.


HDF5 benchmark looks bogus, it's not clear what exactly the author meant by it and how the data was stored/retrieved. With proper filters set up, it should beat any csv reader by both (uncompressed) throughput and runtime


It looks like they are comparing (in the chart) throughput on each of their file formats in bytes, rather than records. So, they are slower than HDF5, but eat up more resources doing so (which I think is meant to suggest they are leaving less to waste).


Unrelated, but holy crap is that typeface color (#79888e) difficult to read.


I wish there was more details on the benchmark. For example, I am interested to know if schema inference was turned on for spark-csv?


The benchmarking scripts are in their repo: this line seems to answer your question: ("inferschema" was turned on, probably to make it an apples-to-apples comparison since static schemas is still on their TODO) https://github.com/wiseio/paratext/blob/ca347a552a53595b680c...


Not MIT licensed



If I'm thinking about adding a third party library to my code base I don't want to have to ask my lawyer to go through the license. Because then I get a bill in the mail. It's that simple.


The APL is a common license. It would serve you well to learn about it instead of talking hypotheticals and expecting everything to be MIT.


Well I don't really see why you can't add an apache2 library to any project. The sole limitation is that you must must publish back any change you made to that specific library under the same license.

Apple for instance use Apache 2 for Swift and they clearly aim at closed source project to use it as a library. So what's the matter?


it's all good. i'm sure they won't mind if you don't use it.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: