Honest question from somebody that doesn't know any better, with the data from the HGP available for quite some time now, it doesn't appear to me (as a layperson) to have had the promised impact or suddenly providing a genetic map that will allow us to quickly find and target genetic diseases and other undesirable traits: would anybody knowledgeable in the field be able to provide some insight into what kind of impact the HGP data has had?
In terms of science research, the impact has been unparalleled. It's hard to find a molecular biology paper that doesn't owe a ton to having the full human genome sequence. There have also been technology side effects. Just like defense projects fueled the early market for silicon-based semiconductors, the HGP kick-started a market for sequencing machines that is generating a huge revolution in sequencing technology where costs are now falling 5x-10x per year, which is quite a bit faster than Moore's law.
In terms of finding genetic causes of disease, this happens every day, and is practically mundane, but there are two complications towards getting to cures. First, most disease is far far more complicated than a single gene; any single gene may account for just a percent or two of what we call the same disease. Second, knowing which gene is broken does not provide a cure for that disease; even for a given small molecule, determining if it will interact with a gene's protein or have any effect on that protein's structure or function is a task that physics has not been able to tackle. Additionally, the genome has only been available for a mere decade, and for many if not most diseases, the process of going from a known gene target an approved drug is going to take far longer than 10 years.
So the HGP has fueled a huge amount of discovery, is the foundation of nearly all human biology research, and is completely indispensable, but in terms of new cures for various diseases it has not delivered, yet, but really it shouldn't have to.
So I guess my question is, and I've seen lots of articles and breakthroughs in genome sequencing, is what use has the actual data been in the HGP? It seems to me that things like this are about as based on the HGP data as velco is to NASA. It's an enormously beneficial spinout technology that happened to have developed as a side-benefit of the main work. I don't know if I'd go so far as to say the sequencing or velcro would have never been developed without the main research focii, but it didn't hurt.
The HGP reference genome is pretty much essential to the "whole genome" analysis done on humans and that's the big direction in research right now. I work in cancer and disease genomics doing data analysis software and all of the analysis methodology goes back to this reference in some way.
Sequencing technology has gotten to a point where it's just blown Moore's Law absolutely out of the water and we can't throw more compute at the analysis problem, we have to make it smarter. The reference genome is used in how that's been made smarter.
It helps to discuss a little bit about how the HGP reference was produced, and why producing it took 10 years and three billion dollars.
The HGP process first had a map made, where the genome was broken into lots of smaller segments. The idea was that this reduced your problem space; any segment of DNA produced from a sample from that portion of the genome came from that area. Then that segment was broken into lots and lots of smaller chunks and then read on the sequencing machines in 600-800 base segments. By the time that sequencing technology reached "max level," the state of the art machine could generate 96 of those segments in an hour's time.
Then you'd calculate overlaps and assemble those smaller "reads" back into a sequence of that chunk you chose from the map. Then someone would audit the computer-generated assembly by hand, possibly ordering up more lab work to fill any gaps or resolve areas of crummy data. Repeat for the next chunk from the map.
Now here's how things work, when we need to do any sort of genomic analysis on an individual:
New technology has the ability to sequence human genomes at deep coverage in 11 days[1], and cranks out 6 billion reads 100bp long from places all over the genome. Computationally, this is an absolutely different animal. You can't feasibly try to re-assemble these reads into a human. So, what we do is use string matching algorithms to "map" a 100bp read back to where it most likely came from, using the HGP genome as a reference.[2] Since obviously your DNA does not match the HGP reference base-for-base, and mismatches/insertions+deletions are really where the interesting data is anyway, there's some leeway for mismatches in the mapping.
At that point, by mapping reads back to where they came from, we end up with a data file that represents an individual's genome. You're able to walk across the genome base for base and ask "So, base 347 of Chromosome 7 is a T in the reference, what is the most likely base on Joe's genome at this point given the reads we have that span this base?"
Mapping things to the reference also allows us to attempt to find really interesting stuff that can cause disease, such as structural variations in the genome. These are instances where large segments are removed, duplicated, inverted, or picked up and moved somewhere else relative to where they "should be."