DNA has been identified as a promising storage system due to its high storage density, capacity and longevity, which is particularly important as our ability to create and collect data has dramatically increased recently. A DNA storage system translates binary code (meaning, the data) into strands of DNA (the encoding step), and then eventually strands of DNA back to the data (the decoding step). To ensure lower error rates and shorter execution times during decoding, DNA clustering algorithms have been extensively used. These algorithms group similar DNA sequences based on the patterns present in those sequences: by avoiding reading multiple similar sequences individually, these methods can improve efficiency, and by correcting erroneous bases from candidate sequences based on their clustering group, they can decrease decoding errors. While there are numerous algorithms for this task, they still lack the required efficiency and scalability for storage systems. To this end, Guanjin Qu and colleagues developed Clover, a tree structure-based DNA clustering algorithm for DNA-based data storage that has lower complexity and greater scalability than previous methods.
To perform the clustering task, Clover begins by creating a database with a core set of subsequences observed in the DNA sequence to be decoded, and every unclassified subsequence is compared with this core set. If the unclassified subsequence has a match in the core set, it is classified in that corresponding cluster; if no match is identified, the core set is expanded with that subsequence. To reduce computation time, an index tree is created for the core subsequences, thereby minimizing the time taken for comparisons. In addition, Clover reduces memory consumption by releasing memory used for a DNA sequence once it has been compared. Finally, to reduce the impact of sequence errors, the authors introduced the concept of node drifting. Traditionally, if an unclassified subsequence matches a core subsequence in all nodes (A, T, G and C) except one, then it is assumed to be an error; with node drifting, if a node does not match the core subsequence, the rest of the unclassified sequence is compared to other nodes: if the next node can be matched, the subsequence is drifted to that particular cluster. Ultimately, this reduces the error rate of the algorithm. When compared to other existing algorithms, the authors showed that Clover performed over ten times faster than all of them. Clover was also used to cluster a sequence containing 10 billion nucleobases, a task that cannot be performed by previously published tools due to their large memory overhead and execution times. Interestingly, Clover completed this task with a 99.99% accuracy rate, showcasing the efficiency of the tool. The results indicated that Clover is a promising DNA clustering algorithm that can take us one step closer into using DNA as a storage system.
This is a preview of subscription content, access via your institution