Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

KataFour, regarding data munging, makes me smile...I've had to do this so many times while wondering, "Is this hard just because I'm doing it the wrong way?"...but it's good to see it's worth a brainteaser exercise.

The weather.dat file in particular is like a vexing issue I've come across when converting PDF tables to text and then writing a script to delimit them. You can't just use a regex because some rows are entirely blank for some fields.

I wrote a solution here for PDFs I had to munge for work: http://www.propublica.org/nerds/item/turning-pdfs-to-text-do...

Basically, you start with some regex pattern as a delimiter and loop through every line and record the left-most position of each value. Then you compare it to a global array that stores the left-most position so far found in the document for a particular column. If the current row has a column x that starts farther to the left than the global array's column x, then that must mean there is a column between x and x-1.

There's some tinkering that has to be done with that, especially if the PDF to text conversion was ugly. But it would work pretty well for this particular set of data.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: