Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The basic assumption is that most of the genome is essentially random. If you take a short substring from an arbitrary location, it will likely define the location uniquely. Then there are some regions with varying degrees of repetitiveness that require increasingly arcane heuristics to deal with.

There are two basic approaches: reference-based and de novo assembly. In reference-based assembly, you already have a reference genome that should be similar to the sequenced genome. You map the reads to the reference and then call variants to determine how the sequenced genome is different from the reference. In de novo assembly, you don't have a reference or you choose to ignore it, so you assemble the genome from the reads without any reference to guide (and bias) you.

Read mapping starts with using a text index to find seeds: fixed-length or variable-length exact matches between the read and the reference. Then, depending on seed length and read length, you may use the seeds directly or try to combine them into groups that likely correspond to the same alignment. With short reads, it may be enough to cluster the seeds based on distances in the reference. With long reads, you do colinear chaining instead. You find subsets of seeds that are in the same order both in the read and the reference, with plausible distances in both.

Then you take the most promising groups of seeds and align the rest of the read to the reference for each of them. And report the best alignment. You also need to estimate the mapping quality: the likelihood that the reported alignment is the correct one. That involves comparing the reported alignment to the other alignments you found, as well as estimating the likelihood that you missed other relevant alignments due to the heuristics you used.

In variant calling, you pile the alignments over the reference. If most reads have the same edit (variant) at the same location, it is likely present in the sequenced genome. (Or ~half the reads for heterozygous variants in a diploid genome.) But things get complicated due to larger (structural) variants, sequencing errors, incorrecly aligned reads, and whatever else. Variant calling was traditionally done with combinatorial or statistical algorithms, but these days it's best to understand it as an image classification task.

De novo assembly starts with brute force: you align all reads against each other and try to find long enough approximate overlaps between them. You build a graph, where the reads are the nodes and each good enough overlap becomes an edge. Then you try to simplify the graph, for example by collapsing segments, where all/most reads support the same alignment, into a single node, and removing rarely used edges. And then you try to find sufficiently unambiguous paths in the graph and interpret them as parts of the sequenced genome.

There are also some pre-/postprocessing steps that can improve the quality of de novo assembly. You can do some error correction before assembly. If the average coverage of the sequenced genome is 30x but you see a certain substring only once or twice, it is likely a sequencing error that can be corrected. Or you can polish the assembly afterwards. If you assembled the genome from long reads (with a higher error rate) for better contiguity, and you also have short reads (with a lower error rate), you can do something similar to reference-based assembly, with the preliminary assembly as the reference, to fix some of the errors.



Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: