This page has been archived and is no longer updated

 
June 18, 2012 | By:  Eric Sawyer
Aa Aa Aa

Your Friend the Sequence Logo

Got a conserved nucleic acid or protein motif that you're just dying to understand? Fear no more, sequence logos are here! Sequence logos are a great way to visualize large amounts of DNA or protein sequences that share in common a conserved feature. For instance, accurate gene expression relies on the work of proteins that can discern specific DNA sequences with high specificity. We call these sequences binding sites. Transcription factors aren't reading DNA sequences like humans do with computers. Rather, they are moving randomly throughout the cell and interacting with the molecules in their environment. When they happen across DNA, whether or not they bind depends on the shape of the DNA, which in turn depends on the sequence. Binding might occur only if the fit is exactly right, or sometimes if the fit is about right. We can use sequence logos to understand where the transcription factor's behavior falls along this spectrum.

How do you go from a big scary data file filled with sequence information to a beautiful sequence logo like at the top of the page? I'm going to walk through an example of how you can use sequence logos to understand these data files. Here's a sample data file, with DNA sequences I randomly generated based on weights from the sequence logo at the top of the page. If the sequence logo is an accurate representation of the behavior of the transcription factor (it happens to be called CEBPA, but that isn't important), then the sequences below wouldn't surprise us in a real dataset.

> Sample Sequence 1

TCTCGCAAC

> Sample Sequence 2

TAGCTCAAT

> Sample Sequence 3

TTTCCCAAC

> Sample Sequence 4

AATACCAAG

> Sample Sequence 5

ATGCGCAAT

> Sample Sequence 6

ATTGGAAAT

> Sample Sequence 7

ATTCTCAAT

> Sample Sequence 8

ACACGCAAT

Before we can make a sequence logo, we need a position-weight matrix (PWM) that tells us how often each base-A, C, G, or T-occurs in each position of the motif. Here's an algorithm for obtaining that information:

1. Remove header lines and gather sequences into a list.

2. Loop through the list, keeping track of the number of A's, C's, G's, and T's by position.

3. For each position, calculate the fraction of sequences containing A, C, G, and T.

4. Combine these fractions into a table, called a PWM.

Once we have the PWM, generating a sequence logo is either a trip to the web or a more strenuous exercise in computer programming. But first let me point out something you might not have appreciated about sequence logos at first sight, and which adds to their utility. Though not shown in my picture, the unit on the y axis of a sequence logo is bits, a unit of information. Positions 7 and 8 in the CEBPA sequence are always occupied by A, and the letters stretch up to a full 2 bits of information. However, position 1 can be either an A or a T (more often a T) yet the combined height is just under 1.5. This is because information is measured in a very particular way; we are not looking at frequencies for each base directly (in which case the stacks of letters would always reach the same height) but a calculation derived from them. Conveying 1 bit of information can be accomplished with a single yes-no question, 2 bits with two questions, and so on1. If I am trying to find out which base is in position 7 (remember, always A), I can ask you "Is it G or C?" You respond "No," and I add "Well then is it A?" and you say "Yes!" This exchange could have gone a number of ways, but I can never get to the certain answer in fewer than 2 questions if I do not already know it beforehand.

That's where the analogy stops being useful, because as you can see most of the information values are fractional, and you cannot ask a fraction of a question. Nevertheless, I hope the analogy (put forward by Richard Dawkins in a different context)* was helpful. The real conversion is a mathematical formula. The less conserved, or predictable, a position in the motif is, the less information it carries. Stacks of low, hardly legible letters, are not important in the binding if your protein.

* It turns out the analogy was not Dawkins's own (and he did not claim that it was). I have just seen it employed in Carl Sagan's Cosmos, episode 11 ("The Persistence of Memory,") which predates Dawkins's essay. I suspect the "analogy" (more of a definition, really) is older than either usage.

Image Credit: Sequence logo obtained from the JASPAR database, ID MA0102.2. The provided frequency table was used to construct my sample DNA sequences.

References:

1. Dawkins, R. A Devil's Chaplain. Boston: First Mariner Books, 2003. See essay "The 'Information Challenge,'" reprinted from The Skeptic vol 18, Dec. 1998.

0 Comment
Blogger Profiles
Recent Posts

« Prev Next »

Connect
Connect Send a message

Scitable by Nature Education Nature Education Home Learn More About Faculty Page Students Page Feedback



Blogs