PDF Miner

brandnewlow · on March 29, 2009

If anyone wants to use this for something public-service oriented:

Chicago is running for the 2016 Olympic games. About a month ago they released their official "bid book" in PDF form. The local papers gave it a look and wrote some fine stories, but a bunch of local journalists (myself among them) would like to extract the thing out into a Wiki so people could discuss and annotate it instead of just reading it in PDF form.

Link to the bid book: http://www.chicago2016.org/our-plan/bid-book/bid-book.aspx

We were thinking of using MediaWiki as the wiki engine. One of us is currently running (the excellent) Chicago Elections Wiki over at http://chicagoelections.pbwiki.com/

We'd host, promote, annotate and fill out the wiki, the important thing is to move this from a pdf to an interactive, scannable, hypertext format so people can tear it apart.

We'd been talking about sneaking into PyCon and asking around if anyone there would be interested in working on this. It looks like this PDF miner is the start of something that could do this.

dmv · on March 29, 2009

For a one-off use, writing code (and finding a coder) might be overkill. I have been impressed by the results produced by http://www.pdftoword.com, which would render to Word or RTF from extremely weird pdf formatting. You should be able to safely convert from there.

brandnewlow · on March 29, 2009

Thanks for the link. I agree with you there. I was thinking maybe a general PDF-to-MediaWiki/PBWiki script might be of use for this and other projects.

There's an open government hackathon this week at PyCon here in Chicago and it looks like at least three of the proposed ideas are PDF-to-Text apps.

http://feedback.sunlightfoundation.com/hackathon/

jacobolus · on March 30, 2009

> We'd been talking about sneaking into PyCon and asking around if anyone there would be interested in working on this. It looks like this PDF miner is the start of something that could do this.

Definitely go for it. I doubt the sprints will take any sneaking to get into. Go find the open government guys sprinting Monday/Tuesday, and I bet you could peel someone off to work on this.

patio11 · on March 30, 2009

We'd host, promote, annotate and fill out the wiki, the important thing is to move this from a pdf to an interactive, scannable, hypertext format so people can tear it apart.

I love technological solutions, don't get me wrong, but we're talking about under 200ish pages, most having copy/pastetable text on them. Set up your Wiki and then head over to one of the freelancer sites, and you'll be done in under a few hundred bucks.

bd · on March 29, 2009

I used it recently for analysis of PDF articles.

It's quite good, though as it is written in pure Python, it's rather slow (especially compared to command line tools written in C/C++).

I strongly recommend using Psyco [1]. Adding few lines of code cut my PDF->HTML conversion times by half.

Also, be warned that markup it produces can be very heavy. Depending on how PDF is structured, you can finish with huge amount of DOM elements.

-----

[1] http://psyco.sourceforge.net/

latortuga · on March 30, 2009

For our startup we had a huge integration project with an industry-specific PDF and so I ended up writing a PDF importer that sounds like it does something similar to this project. The best part is that I couldn't figure out how to get my reader to determine what page a specific set of coordinates was on and it looks like this library supports it - thanks for the link!

jpcx01 · on March 29, 2009

Looks interesting. Any good ruby alternatives?

draegtun · on March 30, 2009

Not sure. However there is a well established Perl one... http://search.cpan.org/dist/CAM-PDF/

albertsun · on March 29, 2009

Nice stuff. So many public documents are released in PDF format instead of an easy to work with plain text format.

mahmud · on March 29, 2009

Does anyone know if something like this exists for C? It would be nice to be able to call it from $LANGUAGE.

bd · on March 29, 2009

You could use innards of some open source PDF viewer (most of them are written in C++).

I managed to get away just with using command line PDF->TXT tool that came with Xpdf [1].

Also MuPDF is awesome [2]. Its bare bones demo PDF viewer replaced Foxit as my default PDF handler.

If you have some more serious budget, there is also PDFlib TET [3], a nice commercial solution with bindings for many languages (C, C++, Python, Ruby, Perl, .NET, and many others).

-----

[1] http://www.foolabs.com/xpdf/download.html

[2] http://ccxvii.net/fitz

[3] http://www.pdflib.com/en/products/tet/

visitor4rmindia · on March 30, 2009

VimTip: You can integrate xpdf with vim to view PDF's directly in your editor.

    autocmd BufReadPre *.pdf set ro
    autocmd BufReadPost *.pdf silent %!pdftotext "%" -

latortuga · on March 30, 2009

Unencrypted PDF files can be opened in any editor, they are plaintext.

sketerpot · on March 30, 2009

AAAAAAGHFGUREH!!! I had to write my own a few months ago, which sucked. If I had known about this, I could have been saved a lot of effort. Noooooo!

Technology moves forward, I see.