I had the need to geocode 10s to 100s of thousands of US addresses weekly, with the ability to accept slightly-reduced accuracy vs. the parcel-level accuracy of Google Maps.
I rewrote the geocommons geocoder in Java to speed up the loading and geocoding process, and wrapped a REST api around it. I used a minimal perfect hash function to map zips/streets (metaphone3'd and ngramfingerprint'd) to data stored in a key-value structure. The key-value structure is small enough to fit in memory of a decent sized EC2 instance, but I haven't tested the throughput except from a slow disk--which got me about 100-150 results/sec.
The results include parsed address, lat/lng in WGS84 datum, and associated US census region info (state, county, block group, block, msa, cbsa/csa, school district, legislative district, etc.).
I'd considered open sourcing it, and I was trying to architect it such that one could plug in various data sources beyond TIGER when higher-accuracy info is available (e.g., from SF's address parcels, Massachusetts has lots of E911 parcel data available, etc).
I rewrote the geocommons geocoder in Java to speed up the loading and geocoding process, and wrapped a REST api around it. I used a minimal perfect hash function to map zips/streets (metaphone3'd and ngramfingerprint'd) to data stored in a key-value structure. The key-value structure is small enough to fit in memory of a decent sized EC2 instance, but I haven't tested the throughput except from a slow disk--which got me about 100-150 results/sec.
The results include parsed address, lat/lng in WGS84 datum, and associated US census region info (state, county, block group, block, msa, cbsa/csa, school district, legislative district, etc.).
I'd considered open sourcing it, and I was trying to architect it such that one could plug in various data sources beyond TIGER when higher-accuracy info is available (e.g., from SF's address parcels, Massachusetts has lots of E911 parcel data available, etc).