Segmenting DNA sequence into words based on statistical language model

Liang, Wang

doi:10.1038/npre.2012.6939.1

Download PDF

Manuscript
Open access
Published: 27 February 2012

Segmenting DNA sequence into words based on statistical language model

Wang Liang¹

Nature Precedings (2012)Cite this article

1159 Accesses
1 Altmetric
Metrics details

Abstract

This paper presents a novel method to segment/decode DNA sequences based on n-gram statistical language model. Firstly, we find the length of most DNA “words” is 12 to 15 bps by analyzing the genomes of 12 model species. The bound of language entropy of DNA sequence is about 1.5674 bits. After building an n-gram biology languages model, we design an unsupervised ‘probability approach to word segmentation’ method to segment the DNA sequences. The benchmark of segmenting method is also proposed. In cross segmenting test, we find different genomes may use the similar language, but belong to different branches, just like the English and French/Latin. We present some possible applications of this method at last.

Determination of k-mer density in a DNA sequence and subsequent cluster formation algorithm based on the application of electronic filter

Article Open access 01 July 2021

SWeeP: representing large biological sequences datasets in compact vectors

Article Open access 09 January 2020

SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning

Article Open access 29 April 2019

Article PDF

Author information

Authors and Affiliations

Tencent https://www.nature.com/nature
Wang Liang

Authors

Wang Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wang Liang.

Rights and permissions

Creative Commons Attribution 3.0 License.

Reprints and permissions

About this article

Cite this article

Liang, W. Segmenting DNA sequence into words based on statistical language model. Nat Prec (2012). https://doi.org/10.1038/npre.2012.6939.1

Download citation

Received: 26 February 2012
Accepted: 27 February 2012
Published: 27 February 2012
DOI: https://doi.org/10.1038/npre.2012.6939.1

Segmenting DNA sequence into words based on statistical language model

Abstract

Similar content being viewed by others

Determination of k-mer density in a DNA sequence and subsequent cluster formation algorithm based on the application of electronic filter

SWeeP: representing large biological sequences datasets in compact vectors

SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning

Article PDF

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Search

Quick links

Abstract

Similar content being viewed by others

Determination of k-mer density in a DNA sequence and subsequent cluster formation algorithm based on the application of electronic filter

SWeeP: representing large biological sequences datasets in compact vectors

SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning

Article PDF

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links