Crowdsourcing the semantic web

olegp · on April 18, 2009

Vertical search engines which scrape a large number of sites already do something like this and some may already have the tools described in the article. That being said most use regular expressions instead of XPath because of malformed markup. CSS selectors are another option.

In my opinion the other side of the problem with uptake of the semantic web is that the tools and formats used to describe and access the data are kind of heavy weight. It would be nice if there was a simple way to define new data types as well as store and access the data. Perhaps something like Google Base with a bit of server-side JavaScript for scraping thrown in?

Maciek416 · on April 18, 2009

We're working on something remarkably close to what the article describes, but not taking the search engine approach you allude to. Rather, we're trying to make it into a useful multi-tool close in spirit to systems like Pachube, a component that coders/bloggers/site authors/whoever can drop into their projects.

And you're right. Semantic formats are very heavy weight. There are a lot of useful things that can be done with semi-semantic data before we achieve full linked-data across the web, if that even ever happens (you could argue for and against such visions, IMHO). Check it out at: http://scrapmetl.com/ and give us a shout on Twitter @Maciek416 and @corban if you're interested in playing.

fizx · on April 18, 2009

Also see http://parselets.com for an open source implementation of this from the guys (tectonic and I) who brought you SelectorGadget.

finin · on April 18, 2009

This reminds me of the W3C's annotea project -- http://www.w3.org/2001/Annotea/