If you're dealing with international addresses, point 3 becomes very challenging. The 'tokens' of an address are called different things everywhere, take different forms, and sometimes don't make much sense to compare. Figuring out a good balance of usability and generality can be really tricky.
I think programmers over complicate things too much when it comes to addresses. When you're dealing with a global system that is able to deliver a package to "the 3rd house on the right, past the pond, with the red fence," or whatever, in rural anywhere, you probably should just treat it like we all do in the real world, as just a string to be interpreted in context.
This fails as soon as you talk to an API that requires normalization, need to aggregate your data, or, really, do anything other than try to deliver it.
Lmao, you're really going to hate me when I tell you I don't think we should be using addresses for anything other than delivering physical things. I know it's a cop out answer. But I don't have a better one.
> Lmao, you're really going to hate me when I tell you I don't think we should be using addresses for anything other than delivering physical things
This is not always an option. For example, in some jurisdictions, you need to calculate, charge, and remit sales taxes based on client location, even if you are selling digital goods.
However these locational jurisdiction details will be unique (i.e. a state, a country, a city, a trading block, a tax code etc).
These can be different to the address details which is nothing more than a postal address to where the goods/letters need to be posted.
In other words the former details should be explicitly provide in a fault tolerant manner (i.e. drop down lists) and not parsed from the details found in the address.
USA mailing addresses are strictly that -- an identifier for the convenience of the Post Office which may or may not correspond to the underlying jurisdictions for things like political boundaries, tax authorities, etc.
I learned this when setting up voter databases. A house that is in a given city/precinct/etc (the atomic identity is parcel number) may have a mailing address in a completely different city. There are many such cases for Los Gatos vs Saratoga, for example. Also, the USPS considers every address to be in some "city" even on unincorporated parcels.
Back when phone area codes and prefixes had precise geographical meanings, many edge cases ended up on the "wrong" area code (e.g. Sunnyvale numbers being in 415/650 instead of 408 between the 85 freeway and the Sunnyvale border).
The local taxes are rarely based on the street address and typically coincides with a larger municipal division.
The problem in that case is trying to use the “address” for two purposes which may not be aligned, getting goods to the buyer and trying to find their tax domicile.
local taxes are rarely based on the street address
No, they're based on the underlying geographical boundaries based on the parcel number. This usually corresponds to the city in the mailing address (in contiguously Incorporated lots) but not always.
On the e-commerce platform I work on we need to send customer data including addresses to multiple vendor API's. You can't just design your db schema in isolation usually.
Some counter-examples include Route Planning or determining where to locate a Consolidating Freight Service. If your addresses are not normalized, you lose the ability to see two addresses listed as "from city center, three lefts" and "from city center, one right" take you to neighboring locations.
These are not counter examples, they are examples of the types of problems that I think turn out very poorly when they rely on normalized addresses. What I'm saying is we shouldn't be using addresses for things like this. I realize I'm talking about, basically, upending several entire industries or something, so I know it's unreasonable, pragmatically.
I just think that the hard problem of 100% accurate address normalization suffers from an extremely fat tail[0] of edge case issues, and becomes economically unviable to solve, very quickly.
95% of the problem is having an up to date record of zip/postcodes. As long as you treat the postcode as a separate field, the rest of the address can be handled by the courier. Most couriers will go off the postcode to determine whether or not they can deliver a package and cost the delivery - even if the rest of the address is garbage.
IMO, A database schema for addresses should consist of: (Country, Postal code, Address). I would split Address into 4 strings and let the user fill out whatever they want in them, in whatever order. I'd also suggest that is isn't even necessary to validate this information on your input forms - but the validation is best performed by contacting the courier with the information input by the user when requesting a delivery cost. (And if this validation fails, email the user back). One suggestion would be to forbid commas in any address fields because CSV formats are accepted by some couriers with varying degrees of support for quoted fields, and often end up requiring manual intervention.
Unless you are the courier, you should probably not waste your resources on attempting to normalize any other parts of the address.
This is from experience in a business which ships hundreds of international packages a day, and tens of thousands domestically (UK).
People usually solve that problem by learning their address according to Google Maps (the most widely used service). For most of us it's the same address we always use. For somebody it's what they write when they want people or stuff to reach their home.
Why not just store GPS coordinates and do geospatial queries? which gives you nice features like "show me stuff within X kilometres of a shop" and "is cutomer X within bounding box Y?".
That might help to some extent depending on the use case, but the tectonic plates are moving all the time, and shifting a lot in some places where you have geological events like earthquakes. A set of GPS coordinates may no longer point to your current (moved) location accurately in that case.
this is going to be quite random but I recently saw a comment you made about deadlines (https://news.ycombinator.com/item?id=19361199) and am interested in some of your thoughts on what an ideal software development methodology looks like, if you care to talk about it please throw me an email smtheard@gmail.com
I’ve been through this before: storing a couple addresses for a few hundred different countries in an otherwise very normalized RDBMS. We couldn’t figure out anyway to do it other than EAV tables and a defined attribute pattern for each country.
Even though that project is long in the past for me, I’m really curious how other people do this. Anyone come up with something better?
At a past job, we basically did "both" - we'd store our best guess at a normalized address, and we'd store the text representation. For stuff like shipping labels, we'd use the text. For analytics, we'd lean on our best-guess normalization (and understand that there's some potentially significant error).
You can also lean on various APIs to normalize, but nothing that I'm aware of does this well on a global scale.
At that point you might as well just make an "Addresses" or "Address" table with a raw text field for the address. At the end of the day this thing is something that a human will read and map to the real world location.
> At the end of the day this thing is something that a human will read and map to the real world location.
That is quite an assumption.
I personally would use a field with a structured format e.g. XML (most RDBMS systems can deal with XML easily) and different types of address field e.g. postcode, zip code etc.
Then you could use a strategy for each country when dealing with it in your application.
We went the EAV route instead of just a free form text field because we still wanted to expose discrete fields to in the UI and have some front-end validation. But sure, from a data integrity standpoint it’s no better than a text field.
> Even though that project is long in the past for me, I’m really curious how other people do this. Anyone come up with something better?
I've always gone the other way. Bland fields (Lines 1-4, city, state, postcode, country) or - when I didn't need to validate it, a textarea for free form entry.
Most of the time validation has been required by the shipping provider, so freeform hasn't been possible. As the shipping providers haven't got anything 'better', we stick to their format and field lengths (which are terrible for out-of-country addresses)
If you are mailing something, using a plain formatted address string is usually fine.
Here's a harder problem from a previous job:
Given two addresses, tell me if they both point to the same "place".
(As you might imagine, the goal of this was joining various datasets.)
There really is no good way to do this, but to come close, you might consider a few dimensions of the problem:
1. Formatted address - a plain string. What you would write on an envelope.
2. Tokens (what I call "address components") - the bits of information that make up an address. Something like, {"street_number": "10", "street": "downing st", "locality": "london", ...}. You can get these back from most location-focused APIs.
3. Locations. Coordinates of a point or area in space.
4. Time. "Places" change over time. For example, a new business can open in place of another. Buildings are built and demolished. If you are working with data over a long time period, this is more of a concern.
Unfortunately, there are many-to-many relationships between all of these. For any address, there are many possible locations. For any location, there are many possible addresses. For any set of address components, there are many possible formatted addresses. For any address, there may have been multiple different "places" found there over time. Et cetera.
Consider an office building, for an example. That building probably has at least one address corresponding to its main entrance, but it might have other entrances. An office within the building can sometimes use the building's main address as its own address, or the address of its own entrance, or an address with a floor number, or an address with a mailroom number. An office building may have a café (usually off of the lobby) that has its own addresses. Offices in the building can be split and merged over time. I have even seen multiple buildings merged into one. You might also have a co-working space with individual addresses for rooms within it.
My logic with the address field is to try to consider what is the purpose of the address in the application.
If it is used for someone to send you a letter in the mail (for example: university admission) then you are better off with a big box where the user can write the magic blob of text that usually gets their mail delivered to the right place, up to instructions to the local postman.
If you need to pull off some analytics, or need to estimate shipping, then just extract the fields that your shipping agent demands: country, zip code etc. Leave all else intact in the "magic blob".
I think city, postal code, state/province (optional), and country would be fairly standard. At least in North America, Japan, Australia, and the UK. Of course for country you would want to use standardized country codes, which are somewhat political.
Here's nearly 100 pages on the subject (from the perspective of addressing mail for USPS): http://www.columbia.edu/~fdc/postal/