teekert
4 months ago
We are not “in the nanopore era of sequencing”. We are (still) firmly in the sequencing by synthesis era.
Yes it requires chopping the genome opening small(er) pieces (than with Nanopore sequencing) and then reconstructing the genome based on a reference (and this has its issues). But Nanopore sequencing is still far from perfect due to its high error rate. Any clinical sequencing is still done using sequencing by synthesis (at which Illumina has gotten very good over the past decade).
Nanopore devices are truly cool, small and comparatively cheap though, and you can compensate for the error rate by just sequence everything multiple times. I’m not too familiar with the economics of this approach though.
With sbs technology you could probably sequence your whole genome 30 times (a normal “coverage”) for below 1000€/$ with a reputable company. I’ve seen 180$, but not sure if I’d trust that.
Metacelsus
4 months ago
>you can compensate for the error rate by just sequence everything multiple times.
Usually, but sometimes the errors are correlated.
Overall I agree, short read sequencing is a lot more cost effective. Doing an Illumina whole genome sequence for cell line quality control (at my startup) costs $260 in total.
bonsai_spool
4 months ago
> But Nanopore sequencing is still far from perfect due to its high error rate. Any clinical sequencing is still done using sequencing by synthesis (at which Illumina has gotten very good over the past decade).
There is no reason for Nanopore to supplant sequencing-by-synthesis for short reads - that's largely solved and getting cheaper all the while.
The future clinical utility will be in medium- and large-scale variation. We don't understand this in the clinical setting nearly as well as we understand SNPs. So Nanopore is being used in the research setting and to diagnose individuals with very rare genetic disorders.
(edit)
> We are not “in the nanopore era of sequencing”. We are (still) firmly in the sequencing by synthesis era.
I also strongly disagree.
SBS is very reliable but it's common (if Toyota is the most popular car, does that mean we're in the Toyota internal combustion era? Or can Waymo still matter despite its small footprint?).
Novelty in sequencing is coming from ML approaches, RNA-DNA analysis, and combining long- and short-read technologies.
teekert
4 months ago
I agree with you. Long reads lead to new insights and over time to better diagnoses by providing better understanding of large(r) scale aberrations, and as the tech gets better will be able to do so more easily. But is really not there yet. It’s mostly research and somehow it’s not really improving as much as hoped, I get the feeling.
Onavo
4 months ago
You can get it pretty damn cheap if you are willing to send your biological data overseas. Nebula genomics and a lot of other biotechs do this by essentially outsourcing to China. There's no particular technology secret, just cheaper labor and materials.
vintermann
4 months ago
Can you trust it though? It'd be trivially easy to do a 1x read, maybe 2x, and then fake the other 28 reads. And it'd be hard to catch someone doing this without doing another 30x read from someone you trust. There's famously a lot of cheating in medical research, it would be odd if everyone stopped the moment they left academia (there have been scandals with forensic labs cheating too, now that I think about it).
gillesjacobs
4 months ago
They save money by cheap labour and batching large quantities for analysis. For the consumer this means long wait times and potentially expired DNA samples.
I tried two samples with Nebula, waited 11 months total. Both samples failed. Got a refund on the service but spent 50usd in postage for the sample kit.
user
4 months ago
jefftk
4 months ago
> We are (still) firmly in the sequencing by synthesis era.
It really depends what your goals are. At the NAO we use Illumina with their biggest flow cell (25B) for wastewater because the things we're looking for (ex: respiratory viruses) are a small fraction of the total nucleic acids and we need the lowest cost per base pair. But when we sequence nasal swabs these viruses are a much higher fraction, and the longer reads and lower cost per run of Nanopore make it a better fit.
the__alchemist
4 months ago
I guess this depends on the applciation. For whole human genome? Not nanopore era. For plasmids? Absolutely.
I'm a nobody, and I can drop a tube into a box in a local university, and get the results emailed to me by next morning for $15USD. This is due to a streamlined nanopore-based workflow.
celltalk
4 months ago
This is wrong, a lot of diagnostic labs are actually going for nanopore sequencing since its prep is overall cheaper compared to alternatives. Also the sensitivity for related regions are usually matching qPCR, and it can give you more information such as methylation on top of that.
A recent paper on classifying acute leukemia via nanopore: https://www.nature.com/articles/s41588-025-02321-z/figures/8
The timelines are exaggarated but still it works and that’s what matters in diagnostics.
BobbyTables2
4 months ago
I’ve always wondered how the reconstruction works.
It would be difficult to break a modest program into basic blocks and then reconstruct it. Same with paragraphs in a book.
How does this work with DNA?
__MatrixMan__
4 months ago
You align it to a reference genome.
Its like you have an intact 6th edition of a textbook, and you have several copies of the 7th edition sorted randomly with no page numbers. Programs like BLAST will build an index based on the contents of 6 and then each page of 7 can be compared against the index and you'll learn that for a given page of 7 it aligns best at character 123456 of 6 or whatever.
Do that for each page in your pile and you get a chart where on the X axis is the character index of 6 and on the Y axis is the number of pages of 7 which were aligned there. The peaks and valleys in that graph can tell you about the inductive strength of your assumption that a given read is aligned correctly to the reference genome (plus you score it based on mismatches, insertions and gaps).
So if many of the same pages were chosen for a given locus, yet the sequence differs, then you have reason to trust that there's an authentic difference between your sample and the reference in that location.
There's a lot of chemical tricks you can do to induce meaningful non-uniformity in this graph. See ChIP-Seq for instance, where peaks indicate methyl markers which typically correspond with a gene that was enabled for transcription when the sample was taken.
If you don't have a reference genome then you can run the sample on a gel to separate the sequences of different length, that'll group by chromosome. From there you've got a much more computationally challenging problem, but as long as you can ensure that it's cut at random locations before reads are taken you can use overlaps to figure out the sequence, because unlike the textbook page example, the page boundaries are not gonna line up (but the chromosome ends are):
Mary had a little
was white as snow
lamb whose fleece was
Marry had
had a little lamb
a little lamb
was white
white as snow
So you can find the start and ends based on where no overlaps occur (nothing ever comes before Mary or after snow) and then you can build the rest of the sequence based on overlaps.If you're working with circular chromosomes (bacteria and some viruses) you can't reason based on ends but as long as you have enough data there's still gonna be just one way to make a loop out of your reads. (Imagine the above example, but with the song that never ends. You could still manage to build a loop out of it despite not having an end to work from.)
vintermann
4 months ago
They exploit the fact that so much of our DNA is the same. They basically have the book with no typos, or rather with only the typos they've decided to call canonical.
So given a short sentence excerpt, even with a few errors thrown in, partial string matching is usually able to figure out where in the book it was likely from. Sometimes there may be more possibilities, but then you can look at overlaps and count how many times a particular variant appears in one context vs. another.
One problem is, DNA contains a lot of copies and repetitive stretches, as if the book had "all work and no play makes jack a dull boy" repeated end to end for a couple of pages. Then it can hard to place where the variant actually is. Longer reads helps with this.
jakobnissen
4 months ago
There are two ways: Assembly by mapping and de Novo assembly.
If you already have a human genome file, you can take each DNA piece and map it to its closest match in the genome. If you can cover the whole genome this way, you are done.
The alternative way is to exploit overlaps between DNA fragments. If two 1000 bp pieces overlap with 900 basepairs, that's probably because they come from two 1000 regions of your genome that overlap by 900 baswpairs. You can then merge the pieces. By iteratively merging millions of fragments you can reconstruct the original genome.
Both these approaches are surprisingly and delightfully deep computational problems that have been researched for decades.
bonsai_spool
4 months ago
This is very easily googled. There are new algorithmic advances for new kinds of sequencing data but this is the key (from the 70s)
nextaccountic
4 months ago
If you broke a string into overlapping blocks you could easily reconstruct it. The key here is that blocks form a sliding window on the string
If blocks were nonoverlapping then yeah the problem is much harder, akin to fitting pieces of a puzzle. I bet a language model still could do it though
jltsiren
4 months ago
The basic assumption is that most of the genome is essentially random. If you take a short substring from an arbitrary location, it will likely define the location uniquely. Then there are some regions with varying degrees of repetitiveness that require increasingly arcane heuristics to deal with.
There are two basic approaches: reference-based and de novo assembly. In reference-based assembly, you already have a reference genome that should be similar to the sequenced genome. You map the reads to the reference and then call variants to determine how the sequenced genome is different from the reference. In de novo assembly, you don't have a reference or you choose to ignore it, so you assemble the genome from the reads without any reference to guide (and bias) you.
Read mapping starts with using a text index to find seeds: fixed-length or variable-length exact matches between the read and the reference. Then, depending on seed length and read length, you may use the seeds directly or try to combine them into groups that likely correspond to the same alignment. With short reads, it may be enough to cluster the seeds based on distances in the reference. With long reads, you do colinear chaining instead. You find subsets of seeds that are in the same order both in the read and the reference, with plausible distances in both.
Then you take the most promising groups of seeds and align the rest of the read to the reference for each of them. And report the best alignment. You also need to estimate the mapping quality: the likelihood that the reported alignment is the correct one. That involves comparing the reported alignment to the other alignments you found, as well as estimating the likelihood that you missed other relevant alignments due to the heuristics you used.
In variant calling, you pile the alignments over the reference. If most reads have the same edit (variant) at the same location, it is likely present in the sequenced genome. (Or ~half the reads for heterozygous variants in a diploid genome.) But things get complicated due to larger (structural) variants, sequencing errors, incorrecly aligned reads, and whatever else. Variant calling was traditionally done with combinatorial or statistical algorithms, but these days it's best to understand it as an image classification task.
De novo assembly starts with brute force: you align all reads against each other and try to find long enough approximate overlaps between them. You build a graph, where the reads are the nodes and each good enough overlap becomes an edge. Then you try to simplify the graph, for example by collapsing segments, where all/most reads support the same alignment, into a single node, and removing rarely used edges. And then you try to find sufficiently unambiguous paths in the graph and interpret them as parts of the sequenced genome.
There are also some pre-/postprocessing steps that can improve the quality of de novo assembly. You can do some error correction before assembly. If the average coverage of the sequenced genome is 30x but you see a certain substring only once or twice, it is likely a sequencing error that can be corrected. Or you can polish the assembly afterwards. If you assembled the genome from long reads (with a higher error rate) for better contiguity, and you also have short reads (with a lower error rate), you can do something similar to reference-based assembly, with the preliminary assembly as the reference, to fix some of the errors.
Danjoe4
4 months ago
Nanopore is good for hybrid sequencing. You can align the higher quality illumina reads against its longer contiguous reads