DNA clustering made more efficient

Rastogi, Ananya

doi:10.1038/s43588-022-00330-0

Research Highlight
Published: 19 September 2022

DNA computing

DNA clustering made more efficient

Ananya Rastogi¹

Nature Computational Science volume 2, page 558 (2022)Cite this article

209 Accesses
1 Altmetric
Metrics details

Subjects

Access through your institution

Buy or subscribe

DNA has been identified as a promising storage system due to its high storage density, capacity and longevity, which is particularly important as our ability to create and collect data has dramatically increased recently. A DNA storage system translates binary code (meaning, the data) into strands of DNA (the encoding step), and then eventually strands of DNA back to the data (the decoding step). To ensure lower error rates and shorter execution times during decoding, DNA clustering algorithms have been extensively used. These algorithms group similar DNA sequences based on the patterns present in those sequences: by avoiding reading multiple similar sequences individually, these methods can improve efficiency, and by correcting erroneous bases from candidate sequences based on their clustering group, they can decrease decoding errors. While there are numerous algorithms for this task, they still lack the required efficiency and scalability for storage systems. To this end, Guanjin Qu and colleagues developed Clover, a tree structure-based DNA clustering algorithm for DNA-based data storage that has lower complexity and greater scalability than previous methods.

To perform the clustering task, Clover begins by creating a database with a core set of subsequences observed in the DNA sequence to be decoded, and every unclassified subsequence is compared with this core set. If the unclassified subsequence has a match in the core set, it is classified in that corresponding cluster; if no match is identified, the core set is expanded with that subsequence. To reduce computation time, an index tree is created for the core subsequences, thereby minimizing the time taken for comparisons. In addition, Clover reduces memory consumption by releasing memory used for a DNA sequence once it has been compared. Finally, to reduce the impact of sequence errors, the authors introduced the concept of node drifting. Traditionally, if an unclassified subsequence matches a core subsequence in all nodes (A, T, G and C) except one, then it is assumed to be an error; with node drifting, if a node does not match the core subsequence, the rest of the unclassified sequence is compared to other nodes: if the next node can be matched, the subsequence is drifted to that particular cluster. Ultimately, this reduces the error rate of the algorithm. When compared to other existing algorithms, the authors showed that Clover performed over ten times faster than all of them. Clover was also used to cluster a sequence containing 10 billion nucleobases, a task that cannot be performed by previously published tools due to their large memory overhead and execution times. Interestingly, Clover completed this task with a 99.99% accuracy rate, showcasing the efficiency of the tool. The results indicated that Clover is a promising DNA clustering algorithm that can take us one step closer into using DNA as a storage system.

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Author information

Authors and Affiliations

Nature Computational Science https://www.nature.com/natcomputsci/
Ananya Rastogi

Authors

Ananya Rastogi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ananya Rastogi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rastogi, A. DNA clustering made more efficient. Nat Comput Sci 2, 558 (2022). https://doi.org/10.1038/s43588-022-00330-0

Download citation

Published: 19 September 2022
Issue Date: September 2022
DOI: https://doi.org/10.1038/s43588-022-00330-0

DNA clustering made more efficient

Subjects

Access options

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Search

Quick links

Subjects

Access options

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links