"A fast reader exploits the capabilities of the storage system"...
the graphs show that their storage system is doing 4.00 GB/sec
I wonder what processor this is running on and what their storage system this is.. multiple PCIe SSD?
I tried running a quick test but only succeeded in OOMing my 8G laptop.
Even just doing
import paratext
it = paratext.load_csv_as_iterator("/dev/shm/tmp/c.log", expand=True, forget=True)
x = it.next()
Starts eating up all my ram after about a minute of spinning the cpu... so I think they have a slightly different definition of an iterator as everyone else.
Compared to
cut -d , -f 5 < c.log > /dev/null
which runs in a few seconds, or a slightly more domain specific and optimized version of 'cut'[1] that runs even faster (300-500MB/s on a single core depending on which fields you want)
https://github.com/BurntSushi/rust-csv does 241 MB/s in raw mode, so I find it a little hard to believe that this is 10x faster... unless that is while maxing out multiple cores.
Does cut work with quoted fields correctly? My understanding was that it was just a dumb line tokenizer and CSV is a little more complex than that, e.g., CSV rows can span more than one line when fields contain line breaks.
Which means an effective thing to do when you control the data pipeline is to ban tabs and line-breaks from your values, and then use `cut` on tab-separated files.
Sure, if you can redefine your data format, you can take shortcuts. But otherwise comparing cut to a CSV parser is just a different way of saying »if it doesn't have to be correct, I can make it as fast as I want«.
> Starts eating up all my ram after about a minute of spinning the cpu... so I think they have a slightly different definition of an iterator as everyone else.
I don't think so, the difference is you're expecting it to read one row at a time and it's actually reading one column at a time. load_csv_as_iterator is iterating over the columns.
It really makes me sad that CSV even exists: ASCII defines field ('unit') & record separator characters (also group & file, but those are less-useful), as well as an escape character. With those few characters, all of the mess of CSV encoding could be solved with these few rules:
- all records are separated by an RS character (#x1e)
- all fields within a file are separated by a US character (#x1f)
- all instances of RS, US & ESC within a field are prefixed with an ESC (#x1b)
- there are no more rules
It's remarkable to me that ASCII defines a pretty full-featured mechanism for information interchange (start of header, file transfer &c.) and instead we continue to build mechanisms atop its alphabetic characters.
It's like the original sin of computing is Not Invented Here (no doubt someone will pipe up with a story of how ASCII itself was the product of NIH!).
I used to religiously believe in this, but in practice it isn't useful. The whole point is to be roughly human-readable and using non-printing characters defeats that. You can't even easily enter these things via the command line.
If we're abandoning human-readability, why even bother with ASCII? Just use a binary format. Has anyone actually used ASCII unit and record separator delimiters successfully? I'd be curious about what advantages they had over a binary format, even just a protobuf or Thrift serialized form. If we want to preserve schemalessness, there's stuff like Sereal.
> Has anyone actually used ASCII unit and record separator delimiters successfully?
At my first job, the (proprietary) database server we used had a communication protocol that used the ASCII STX/ETX/EOT/ACK/ETB/GS/RS/US characters. When I looked up those ASCII codes I was like "Oh, that's clever", but in practice, it was just another proprietary binary format, and had all the problems of proprietary binary formats. The control codes still needed to be escaped in source code; they still weren't visible in log files or when viewing console output. They were basically just bytes we had to send to implement the protocol.
> The whole point is to be roughly human-readable and using non-printing characters defeats that.
You're assuming that separator characters are not human-readable. If ASCII had been used as originally intended, they'd be just as readable as line breaks. Err, carriage returns.
Most people build systems to work in the world we have, not the world we wish we had.
There are all sorts of examples like this. UNIX was initially designed to work with pipes of line-oriented streams, so why did we get scripting languages where every UNIX command is reinvented as a function call or RPC frameworks where the pipe is replaced by a binary message? PHP was initially a templating language, so why did we get Smarty, Wordpress, and PEAR templates? The web was supposed to come with full support for editing & creating pages via WYSIWYG editor and have a built in mechanism (hyperlinks) for associating pages with other people, so why did we need Facebook to introduce the idea of "sharing content with other people".
In each case, there were real, pragmatic reasons that people invented new systems instead of doing what they were "supposed" to do. Line-oriented files are clumsy for representing hierarchical data or conditionals. PHP is too hard to use for most end-users, despite being built for pragmatic "just toss a webpage up" use. The editing features in the web disappeared early on, with Netscape, and they needed the critical mass of college students that Facebook provided before people felt they had an audience for anything they did.
The moral for system designers is that you can't just throw a feature out there and say "Use this." You have to adapt it to how people actually do use it, even if that usage seems brain-dead to you.
One of the problems with this mechanism is data entry. It requires specialized software to insert an RS character instead of a newline ('\n') when going to the next row, for example.
After it has been saved in that format, it is a text file consisting of one very long line (unless some of the fields have newlines, but that doesn't represent the actual structure). This makes it inconvenient to browse with a text editor or pager. Sure, you could have a special mode to make them show up as newlines, but then you can't tell the difference between real newlines and newlines inserted for convenience.
Consequently, Unix text-processing tools like cut, sed, and awk use a line feed ('\n') as a record delimiter, unless there is a special reason not to (e.g. the -print0 flag on find).
I regularly use non-printable ASCII characters as delimiters. I just confirm first that they are not in the input. I am going to start using the ones you mentioned. This solution is one I rarely if ever see used. The other solution which I have seen in practice is to encode the comma character. The problem really IMO is that people who make CSV place no limits on what range of characters is stored in the tables. Certain characters can safely be removed. In most cases the input is still comprehensible without commas. Commas can be deleted, freeing up their use exclusively as record separators.
The main advantage of CSV is that it is human-readable. But if you have this much data, then why not use some binary format that you can just read in as a blob, and don't even need to parse?
Also, the approach mentioned here is only useful if reading+parsing the data is the bottleneck (or close to it). For example, if reading+parsing takes only 10% of the total processing time, then optimizing this stage will only give at most a 11% increase in performance.
Csv is an ubiquitous data format. Especially when you're receiving data from external sources which you don't have control over, which often happens to be the case in banking and finance (streaming csv dumps are quite common)
CSV is stupid format. XML, JSON, YAML are better to read and easier to parse/validate. After compression these formats takes similar amount of space or even less when cross-referencing.
CSV, tab-delimited files and in general flat files are retarded idea to be used in banking. I know because I work for bank, these are source of all misery.
Adding to what bladecatcher said, even if you want to work internally with a binary format, the fact that you receive large csv datadumps from the outside means you'll have to transform them into your internal format of choice. The first part of that transformation would greatly benefit from a fast csv parser such as ParaText.
Another datapoint: data changes (price updates, number ranges) between European telecom operators are all csv files over ftp. You can recognize the security aware by their use of sftp!
"For example, if reading+parsing takes only 10% of the total processing time, then optimizing this stage will only give at most a 11% increase in performance."
10% performance increase is nothing to scoff at in a mature code base.
> Despite extensive use of distributed databases and filesystems in data-driven workflows, there remains a persistent need to rapidly read text files on single machines.
This comment makes me wonder how often a distributed approach is used out of some odd sense of convenience or interest, instead of taking time to create an optimized single node approach?
In other words, how often is an Hadoop-like system used when unnecessary, adding unnecessary complexity?
HDF5 benchmark looks bogus, it's not clear what exactly the author meant by it and how the data was stored/retrieved. With proper filters set up, it should beat any csv reader by both (uncompressed) throughput and runtime
It looks like they are comparing (in the chart) throughput on each of their file formats in bytes, rather than records. So, they are slower than HDF5, but eat up more resources doing so (which I think is meant to suggest they are leaving less to waste).
The benchmarking scripts are in their repo: this line seems to answer your question: ("inferschema" was turned on, probably to make it an apples-to-apples comparison since static schemas is still on their TODO) https://github.com/wiseio/paratext/blob/ca347a552a53595b680c...
If I'm thinking about adding a third party library to my code base I don't want to have to ask my lawyer to go through the license. Because then I get a bill in the mail. It's that simple.
Well I don't really see why you can't add an apache2 library to any project.
The sole limitation is that you must must publish back any change you made to that specific library under the same license.
Apple for instance use Apache 2 for Swift and they clearly aim at closed source project to use it as a library. So what's the matter?
"A fast reader exploits the capabilities of the storage system"...
the graphs show that their storage system is doing 4.00 GB/sec
I wonder what processor this is running on and what their storage system this is.. multiple PCIe SSD?
I tried running a quick test but only succeeded in OOMing my 8G laptop.
Even just doing
Starts eating up all my ram after about a minute of spinning the cpu... so I think they have a slightly different definition of an iterator as everyone else.Compared to
which runs in a few seconds, or a slightly more domain specific and optimized version of 'cut'[1] that runs even faster (300-500MB/s on a single core depending on which fields you want) I also wonder if that is 2.5 GB/s per core.https://github.com/BurntSushi/rust-csv does 241 MB/s in raw mode, so I find it a little hard to believe that this is 10x faster... unless that is while maxing out multiple cores.
[1] https://github.com/bro/bro-aux/blob/master/bro-cut/bro-cut.c