Introduction
What is a 'gene'? Surprisingly, in the world of biology and genetics there is no longer a straightforward answer (cf. Pearson, 2006). For instance, Snyder and Gerstein (2003) define a gene as 'a complete chromosomal segment responsible for making a functional product' and then discusses five criteria for identifying genes in the DNA sequence of a genome. The most common feeling is that it should be a piece of nucleic acid. At the onset of molecular biology (Benzer, 1959), the significance of the term gene was clear: it was the unit of function identified by genetic methods, as colours of flowers, the shape of a wing, number and shape of bacterial colonies on a Petri dish. This analysis had nothing to do with DNA nor RNA but functions exclusively. According to current insight in molecular biology, the only meaningful conception of a gene is the one of a functional and not of a hereditary unit (see for example Brosius, 2006).
The concept of the cistron (contiguous genomic elements acting in cis, essentially the protein coding sequence) introduced by Benzer (Benzer, 1959, 1961; Benzer and Champe, 1961) and extended by Jacob and Monod (1961) related the gene to an un-interrupted piece of DNA, able to complement a function in a cis/trans test. The equation function = gene = polypeptide = continuous piece of DNA=cistron seemed acceptable in first approximation. However, when several genes were found to constitute an 'Operon' (Jacob and Monod, 1961), representing a programme of gene expression, other problems arose with the introduction of the notion of regulatory genes; for instance, the gene coding for the lac-repressor protein. The latter has to attach to the operator, a DNA sequence placed in cis upstream of the genes in the operon. The operator—is it (part of) a 'gene'? The lac function is based on operator action, thus it is related to the phenotype; but the lac repressor gene is not part of the cistrons controlled by the operator.
With the advent of eukaryotic molecular biology, the problem of defining the gene became even more complicated. In eukaryotes, the tight physical complex linking transcription and translation in bacteria does not exist; the polyribosomes are removed from the DNA, which is stored away in the nucleus. As a consequence, the dimensions of space and time entered gene expression (see Figure 1, inset A; and The Cascade of Regulation (Scherrer and Marcaud, 1968)) and new types of controls had to be considered, in particular at the level of the, by now, autonomous messenger RNA (mRNA). There is an untranslated region (UTR) of about 50–250 nt (Gray and Hentze, 1994; Hess and Duncan, 1996) preceding the coding sequence in the mRNA, and at the end of the mRNA chain the 3'-side UTR which, surprisingly, in some genes (for example, the Prion mRNA) grew to become longer than the coding sequence. Being contiguous and in cis, upstream and downstream of the coding sequence, such a construct of nature did not fit the original concept of the gene.
Figure 1
The Cascade of Regulation (Scherrer, 1967, 1980): The information content of the zygotic genome is gradually reduced to that expressed in a differentiated cell. In Homo sapiens an estimated 500 000 polypeptide-genes are reduced to a few hundred in gradual steps; as few as 3 genes may account for up to 90% protein output, as is the case in red blood cells. The Holo-Cascade (not shown) includes additional steps, leading upstream from the information content of an entire species to that of populations and individuals, and downstream from the polypeptide to the assembled protein including all post-translational modifications (Scherrer, 1980). Under the direction of the holo-genon and holo-transgenon, the genomic information is reduced by DNA rearrangements to that of an individual cell, and then by individual steps to the expression of an individual function, as shown here and outlined in the text. These may include the following (1–2) chromatin modification and activation (protogenon-dependent); (3) transcription and formation of pre-mRNP (pre-genon); (4–6) gradual processing and splicing (pre-genon); (7) export and formation of cytoplasmic mRNP (genon); (8–9) activation (de-repression) of mRNP (genon); (10) translation of mRNA (genon) followed by peptide formation (genon has expired) and gene expression. (A) The spatial transfer of a genomic transcript to RNA processing centres (Iarovaia et al, 2001) and the nuclear periphery and (specific sites in) the cytoplasm (De Conto et al, 1999) induces a delay and, hence, a vector in time. Processing and transport steps may be temporarily interrupted, and lead to considerable delay of expression (up to 30 years in the case of histone mRNA in human oocytes), and the constitution of 'peripheral memories' (Scherrer, 1974, 1980), from where the gene will be created and/or expressed, eventually. These may be in the form of unspliced (fragmented genes) or finally spliced pre-mRNPs, cytoplasmic mRNPs or miRNA complexes. (B) The genomic information being gradually reduced along the Cascade, information from the external space of the cell and organism plays a gradually increasing function. Highest at the periphery, concerning for example all cell-surface receptor functions, some external instructions may reach the genomic DNA to bring about a physiologic change. This can be conceived as an 'exo-cascade' as proposed here, 'infiltrating' the 'endo-cascade' of gene expression.
Full figure and legend (337K)Figures & Tables indexAnother problem arose with the observation that mRNA was able to form mRNA–protein (mRNP) complexes. It was found that specific proteins recognise and attach to specific sequence motifs along the mRNA chain, and not only in the UTRs, but right inside the coding sequence, as could be proven early on for globin mRNAs (Dubochet et al, 1973). This indicated that, superimposed onto the coding sequence, there must exist protein-binding sites most likely subject to a particular code of interaction. 'Free' mRNPs, found in vivo outside the translation machinery of the polyribosomes, are not translatable in vitro, unless most of the mRNP proteins are removed (Civelli et al, 1980); these proteins seem hence to relate to repression. RNPs may also form higher order complexes (cf. review in Dreyfuss et al, 2002), assembled by interaction of (pre-)mRNA with proteins and protein complexes (De Conto et al, 1999), or cellular structures, such as, the cytoskeleton (Singer, 1992); they constitute part of the backbone of the nuclear matrix (De Conto et al, 2000; Razin et al, 2004; Ioudinkova et al, 2005).
A fatal blow to the original gene concept came with the observation of giant precursor RNA and their processing (Perry, 1962; Scherrer and Darnell, 1962; Georgiev et al, 1963; Scherrer et al, 1963; Perry et al, 1964) (review in (Scherrer, 2003)). After the experimental observation of pre-mRNA (Scherrer et al, 1970), the discovery of 'splicing' (Berget et al, 1977; Chow et al, 1977) implied the fragmentation of the coding sequence at the genomic level: in most cases, only fragments and not intact genes are stored in DNA. According to the original genetic definition of the gene, and the cistron concept, this indicated that, each time it needs to be expressed, the gene has to be created from its parts encoded in the DNA (Figure 1, inset A; cf. disc. in Scherrer, 1980, 1989; Brosius and Gould, 1992). Interestingly, the occurrence of differential splicing and, as a consequence, the fact that the same DNA domain can contain the information for different genetically identifiable functions, indicated clear separation of the gene as a function from its genomic counterpart in the form of DNA, transmitted from generation to generation. Accordingly, these two matters might also be separated conceptually and in terminology.
The discovery of 'polycistronic' giant RNA (Scherrer et al, 1966) (review in Scherrer, 2003) and the formulation of the pre-mRNA concept (Scherrer et al, 1966; Scherrer and Marcaud, 1968) made it particularly evident that it is impossible to concentrate into a single term all types of information involved in the expression of a single genetic function. To lead out of the actual confusion we suggest to break up the process of gene expression into its basic mechanistic, and thus logical units, namely gene function on the one hand and the mechanisms of storage and expression on the other. In this process, one is led to propose new concepts and terms that give precise definitions to comprehend gene expression in terms of Molecular Biology, and make it possible to analyse gene storage and expression in terms of information processing. Our intention is to present here a functional and information theoretic analysis of gene storage and expression.
Gene definition, expression and regulation
The basic system of information involved in gene expression, and the one most easily defined, is the coding sequence contained within the mRNA and its counterpart after translation, the nascent polypeptide. This leads back to the original definition of the gene, the unit of genetic function and analysis as used by Mendel and Morgan. It implies the polypeptide chain as the underling basic unit of function, or its equivalent, the uninterrupted nucleic acid stretch of the coding sequence, the 'cistron' of Benzer (1959). This nucleic acid stretch emerges at the level of the mRNA in eukaryotes, and in most cases is not present at the DNA level as an uninterrupted sequence. In the discussion below, this should be the unique and exclusive definition and meaning of the term 'gene'.
Attached to such a gene is the 'history' of its 'creation' from pieces in the genome before its expression; in other words, along with the transcript comes a programme that secures the formation of the mRNA and its expression in time and space. This programme will be conceptualised as the genon. Within this programme, two kinds of elements of control may be distinguished: (1) the cis-acting signals, which form (oligo-)sequence motifs contained in the same strand of DNA or RNA as the fragments of the coding sequence, and (2) the trans-acting factors which act on the signals placed in cis. Both participate in the programme that secures the generation of the gene, in the cellular space and in time, through the many steps of gene expression.
The genon concept concerns gene expression at large. Some genetic information, however, is only indirectly related to gene expression, like the 3D organisation of DNA and chromatin (see the Unified Matrix Hypothesis (Scherrer, 1989), and the recent observation of '3D-gene regulation' (Spilianakis et al, 2005)). Furthermore, epigenetic mechanisms of gene expression and transmission modify the genon in cis and its precursors at DNA level; the genon is, thus, flexible and not a rigid programme. Quite in general, we consider here only regulation directly related to gene expression, leaving out other types of signalling and metabolic controls. These points will be detailed in a more extensive analysis of gene expression and the genon concept (Scherrer and Jost, submitted to Th Biosci).
Basic principles and development of the genon concept
The process of gene expression entails many steps within 'the Cascade of Regulation' (Scherrer and Marcaud, 1968; Scherrer, 1967, 1974, 1980), which reduce the genomic information to that of a gene in a stepwise manner. These include chromatin modification and activation, transcription and RNP formation, processing and transport of the pre-mRNA, formation and export of the mRNA to the cytoplasm, activation (or de-repression) of mRNA and, finally, translation (see Figure 1; Scherrer and Jost, submitted to Th Biosci). The cis programme guiding this process is unique to each distinct mRNA and polypeptide to be formed, although the same signals, in distinct combinations, may be used on the expression pathways of similar or different genes. To express this fact, we suggest the term 'genon' (contraction of 'Gene' and 'operon') for the cis-acting programme, associated with a specific gene at mRNA level but encoded originally in the DNA. The ensemble of trans-acting factors bearing on a genon constitutes the 'transgenon' of an mRNA or an ensemble of mRNA in a cell compartment, a cell or organism. Figure 2 shows the basic propositions of the genon concept.
Figure 2
Coding sequence, gene, genon and transgenon: the amino-acid sequence of a polypeptide represents the gene, as the basis of a function; its equivalent at RNA level is the coding sequence which is inserted into the mRNA and framed by the 5'-end and 3'-end UTRs. In the latter and superimposed onto the coding sequence is the genon, a programme in cis of sequence oligomotifs, eventual binding sites for regulatory proteins (or si/miRNAs—not shown). The holo-Transgenon of a given cell is constituted by all these factors, which eventually will recognise the oligomotifs (empty coloured circles) in the genon in cis. A subset of factors (filled circles) interacting with a specific mRNA constitute the latter's Transgenon (PABP: poly(A)-binding protein). If the gene is a functional RNA, the same formalism applies.
Full figure and legend (340K)Figures & Tables indexGenons and transgenons are flexible programmes and may be modified without touching the DNA sequence. In cis, the holo-genon is modified when somatic or heritable epigenetic modifications occur, for example by DNA methylation. In somatic cells, the transgenon is constantly adapted by addition and elimination of factors of genomic or environmental origin and there are heritable protein and RNA factors involved in genetic and epigenetic regulation (for recent reviews see Delaval and Feil, 2004; Peaston and Whitelaw, 2006)).
The genon acting in cis
As defined above, the genon represents a regulatory programme superimposed and attached to a given coding sequence. It is materialised in cis by the ensemble of signals within the mRNA primary and secondary structure that control the expression of the coding sequence contained. These signals (henceforth referred to as 'oligomotifs') are either superimposed onto the coding sequence or materialise within the mRNA sequence of the 5'- and 3'-side UTR; the mRNA sequence carrying a given programme is, therefore, longer than the coding sequence to which it is attached. In this manner, a specific genon in cis is defined for every gene (Figure 2). Implementation of the genon-programme in cis is carried out in trans by NABPs on the one side, and by interfering small RNAs (siRNAs, miRNAs) on the other; altogether, these factors provide the transgenon the programme in trans (see below), corresponding to a given genon (respectively mRNA).
A polycistronic pre-mRNA and/or a full domain transcript (FDT) (Broders et al, 1990) might thus carry a 'Pre-genon', controlling in cis one or several coding sequences. It may be polycistronic (several (fragmented) coding sequences in a row) or polygenic, containing the fragments of several genes to be crated by differential splicing. 'Proto-genon' designates the signals of a DNA domain including a specific pre-genon and, in addition, the signals for chromatin modification and transcriptional activation. Each mRNA produced by alternative splicing would thus carry a genon as the remaining elements of its pre-genon. Eventually it will form a distinct (mono-)genon in the mRNA, including all cis-acting signals. At the genomic level, the term 'holo-genon' designates the sum of all (proto-)genons. Figure 3 shows this process from DNA to mRNA expression.
Figure 3
From DNA to mRNA expression: proto-genon, pre-genon and genon: (1) the genomic domain (line A) with exons (light green) and fragments of coding sequences (dark green) as well as intra-genic and inter-genic (not shown) DNA, contains instructions for modelling and activation of chromatin; this constitutes the proto-genon (A'). (2) From these, a pre-mRNA (B) or a FDT with its pre-genon (B') may spring off. The latter may contain gene fragments subject to differential splicing; shown is the case of a pre-mRNA containing the two ORFs 1 and 2. (3) After processing and splicing, the two mRNAs are created with their respective genons and, thereafter, the two gene equivalents, the mRNAs (1) and (2). (4) After translation, peptides 1 and 2 secure two functions. If the gene is a functional RNA, the the same formalism applies. Inset: To the proto-genon, pre-genon and genon signals (oligomotifs) carrying distinct instructions for specific steps of processing and gene expression (left), correspond factors from the transgenon (right), in active or inactive states, which may (or not) implement the corresponding control.
Full figure and legend (403K)Figures & Tables indexThe concept of the genon relates to the cis programme directly, and only indirectly to the transgenon, the system of trans-acting factors. Indeed, each trans-acting factor of protein or RNA nature is the result of a gene and its own genon. Implicit in the genon concept is the fact that there are at least as many genes and genons as distinct open reading frames (ORFs) encoded in the genome. Accordingly, the 36 000 or so genomic domains identified within the human genome project (cf. Venter et al, 2001; for more recent estimates, see Pennisi, 2003) would encode about 500 000 genes producing as many polypeptides. The genomic domains emerging, possibly, from sequence data correspond—by order of magnitude—to the highest molecular weight transcripts (FDTs) in eukaryots and to the DNA in loops of lampbrush chromosomes, or in the chromosome bands of polytene chromosomes of diptera. These were identified as units of transcription, and in sciaridae as units of local DNA amplification, and cytogenetically as units of meiotic recombination (cf. discussion in Scherrer and Marcaud, 1968; Scherrer, 1980, 1989; Scherrer and Jost, submitted to Th Biosci).
The genon and its precursors act at the transcriptional and post-transcriptional levels and lose their function with mRNA translation and subsequent degradation. Therefore, we will not consider here the downstream programmes governing gene expression post-translationally, or the catabolic aspect of protein homeostasis. However, most obviously, gene expression implies control of amounts as well as types of gene products and, therefore, RNA and protein degradation as well as biogenesis. This is achievable only by interplay and coordination of protein biosynthesis and degradation, as conceivable for example, within the prosome–proteasome system (Scherrer and Bey, 1994).
The transgenon
The genon in cis as outlined above is materialised by the ensemble of factor binding sites ('oligomotifs') within an individual mRNA sequence. These sites are recognised by protein or RNA factors supplied by the programme in trans. These are available—or not—within the holo-transgenon of a given cell, nucleus or cytoplasm. We exclude here all mechanisms directly related to constitutive and basic protein biosynthesis within the frame of the genetic code, such as the ribosome and the basic tRNA machinery.
Regulation of transcription, and hence of programmes of differentiation and physiological change, occurs largely under the influence of cell-external factors, constituting some kind of Exo-cascade (see Figure 1, inset B), which act either directly or via the transgenon.
The genon of an mRNA is, thus, plunged into the pool of trans-acting factors recognised by the receptor oligomotifs in cis, 'fishing out' its specific transgenon. The presence of these factors is thus crucial for execution of the expression programme encoded in the genon in cis. Being automatically picked-up by the oligomotifs in cis, these factors have a discriminative regulatory function. Their presence or absence controls the implementation of the cis programme; furthermore they may be present in active or inactive state. As proteins are capable of integrating many types of input, small molecular weight agents may influence factor–signal interactions either directly or as allosteric effectors.
Nucleic acid-binding proteins as carriers of the transgenon
A major part of the transgenon is constituted by NABPs, which are produced from cP-genes by the normal mechanisms of gene expression and regulation by protein biosynthesis. All types of RNA in the cell are covered by proteins (1:3 ratio in mRNPs). In the case of mRNA, it was shown by electron microscopy (Dubochet et al, 1973) that proteins are aligned along the entire length of the RNA molecules (and it is thus likely also for pre-mRNA), protecting specific oligomotifs from degradation by RNase (Goldenberg et al, 1979). Only in rare cases the oligonucleotide sequence that binds a given regulatory protein is known; in addition to the poly(A)-binding protein (PABP), one might mention, for example, the IRE-BP (iron response element binding protein), a protein that binds an oligomotif in the 5'-side or 3'-side UTR of the mRNAs for ferritin and transferrin, respectively (Thomson et al, 1999).
These observations indicate that there must be a 'code' governing the interaction of a limited number of NABPs in chromatin and mRNPs, which, in general, are specifically DNA- or RNA-binding proteins. However, relatively new data have confirmed the earlier observation that a given protein may bind both, DNA and RNA. This was originally observed for the large T-antigen of SV40 and polyoma virus (Darlix et al, 1984) and more recently confirmed for a series of DNA-binding MAR proteins, long identified as hnRNP proteins (von Kries et al, 1994).
Nucleic acids as carriers of the transgenon within RNAi
As mentioned above, RNAi represents another mechanism of post-transcriptional regulation acting on pre-mRNA and mRNA interacting with the genon in cis. Because this topic is presently undergoing rapid development, we should defer this discussion to our forthcoming more extensive analysis (Scherrer and Jost, submitted to Th Biosci).
Mathematical analysis of genetic information and gene expression
General considerations
We will go beyond the classical application of information theory to molecular biology by defining a mathematical framework that distinguishes the mere coding information from process and product information contained in the genon and thus naturally includes gene expression and regulation.
Thus, we will approach here the mathematical analysis of the genon, that is the programme governing the expression of a gene. The analysis will start with the programme in cis, that is, the ensemble of genon-related signals encoded in DNA and RNA, and then be extended to trans-programme that is, the—rather heterogeneous—ensemble of factors, provided either by the genome or the environment of cell and organism, which interact with the signals in cis. The decision-making processes may be analysed in a different manner, according to whether the hologenon (all proto-genons encoded in a genome) of an organism or specific cell is considered, or the genon of an individual mRNA. Analysing gene expression by starting at the DNA level, the first decision-making process logically concerns the 'immersion' of a holo-genon in cis into the population of factors provided by the transgenon of a given cell, as occurring during the dispersion of sperm DNA (devoid of any attached protein) into the ooplasm of an egg after fertilisation. It is simpler, however, to start from the other end, and first consider the immersion of a genon, as present in a given mRNA, into the cytoplasm of a given specialised cell. In a given cytoplasm, there are only about 500–1000 RNA-binding protein factors to which a given mRNA may be exposed. This approach reduces the selection process to the interaction of such a population of factors with the dozens or so signals in a given genon. In a first step, we consider a theoretically maximal trans programme, assuming that all possible trans-factors within a given genome were available to an mRNA; the decision process in this case would be reduced to the 'fishing-out' by a given mRNA of the factors corresponding to its genon in cis (see Figure 2). The constitution of the trans programme for a specific cell might be analysed in a further step.
Information theoretic aspects of gene storage and expression
Information theory, as developed by Shannon, is concerned with the reduction in uncertainty obtained by receiving a particular message m drawn from some ensemble W when, before actually receiving it, one only knows the probabilities pm of the messages m (which satisfy the normalization
m
wpm=1). These probabilities may be constructed as relative frequencies obtained from counting the messages received in the past. The information gained or, equivalently, the uncertainty reduced by receiving the actual message then is quantified by the entropy

The simplest nontrivial situation arises when we have only two messages, each of them occurring with equal probability 1/2. In that case, I=1(bit). Messages with probability=0 do not contribute, because 0 log 0=0, and when one message m is certain to occur, that is pm=1, then we gain no information (1 log 1=0) by receiving it because we knew it already before.
In molecular biology as well as in other applications of information theory, the messages are often sequences S=(i1, i2, i3, ..., im) composed of symbols i drawn from some alphabet A. Each symbol again occurs with some probability pi, and when the symbols in the sequence S are chosen independent of each other, the probability of that sequence is

As such, information theory is a formal tool, and for applying it, we need to specify the ensemble of messages or the alphabet of symbols. In molecular biology, the symbols are either nucleotides or amino acids. The former applies to DNA sequences, which are composed of four nucleotides, A, C, G and T. When each of them occurs with relative frequency pi (i=A,C,G,T), each position contributes an information of

In particular, when all pi=1/4, this information is 2 (bits). For other values of the pi (still satisfying the normalization
i=A,C,G,Tpi=1), Inuc is smaller. When all positions in a sequence of a length N are independent, the sequence information is Iseq=N Inuc. Sequence correlations, however, will decrease that information.
The other type of sequences are polypeptides, which are composed of amino acids. There exist 20 different amino acids, and we denote the relative frequency of an amino acid
by p
. Thus, the average information required for specifying an amino acid is

When all these frequencies are equal, Iaa=log2 20. Otherwise, the value of Iaa is again smaller.
Owing to the degeneracy of the genetic code which leads to redundant coding for amino acids, the information needed to specify an amino acid is smaller than the one contained in a triplet (log 20<log 64=6).
Therefore, when we pass from a coding sequence in the DNA to the polypeptide that it codes, we see a reduction of information.
This is the standard application of information theory to molecular biology, but for applying the concept of the genon, we need to turn things around. Instead of the backward perspective where one starts with a polypeptide and quantifies the uncertainty about the coding sequence in the DNA or RNA, we start with a genomic domain including a variety of gene fragments and look at the ensemble of functional products derived from it (a collection of polypeptides or functional RNAs, grouped according to type). As the case of functional RNA is similar, for simplicity we here only consider the situation where those functional products are polypeptides. One such coding domain can be transcribed and translated several times, and during the expression process, mechanisms like alternative splicing or translational frameshift may even lead to chemically different polypeptides containing amino acids derived from a single coding sequence. Thus, we have a single coding domain S at DNA or RNA level, but an ensemble of products x derived from it. Let the relative frequency of x in that ensemble be qx. Therefore, the uncertainty of the result of the expression of S is given by

The important point we want to make here is that this information I(S) is not contained in the sequence S, but is rather provided by the (proto-, pre-)genon that accompanies it on the expression pathway and controls in which polypeptide it will end up. Therefore,

A finer analysis evaluates the information contributed by the different steps in the regulation of the expression process. At each step, we have certain binding sites in cis, and we have certain trans factors that bind or could possibly bind there. At the transcription and RNA level, we see mechanisms of transcription activation, RNA processing, promotion, enhancement or repression. Most of them are affected by nucleotides outside the coding sequence itself, but also the coding region can provide specific protein-binding sites. Each of these yields information about the ensemble of products derived from the coding region that is not contained in the nucleotide sequence information obeying the genetic code. More precisely, the information that counts here is not about the identity of a nucleotide or an amino acid derived from it, but about the relative frequency of the transcription and generation of a particular type of coding sequence. This then contributes to the determination of the types and numbers of functional products derived from the DNA coding region under consideration. Whereas the selection of proteins that can bind at those regulatory sites is determined by the chemical identity of their nucleotides and therefore represents a contribution from the genon in cis, the availability or the relative frequency of different proteins competing for a given binding site counts as a contribution from the transgenon. At a later stage, the mechanism of alternative splicing selects between different compositions of the final functional product. Again, that selection yields a reduction of uncertainty, that is, an information because before the splicing process, we do not know which alternative will be chosen. And that information is provided by the genon because the number of possibilities cannot be directly read from the chemical composition of individual nucleotides without knowing the context. In other words, it is a process and not a product information. In a similar manner, we can look at all the different steps in the cascade of regulation. Whereas nucleotide identities specify a final amino-acid product, on the expression pathway, here the important information is which proteins or RNAs (in the case of RNAi) bind to a given mRNA and in which manner they affect the expression process. Again, this is part of the information carried by the genon. As before, the presence or absence of those regulatory proteins or RNAs has to be counted as information provided by the transgenon.
With our information theoretic analysis, we can also compare different DNA segments S and
, where the smaller segment S is contained in the longer one
. Here, from the backward point of view of product information,
contains more information when it codes for a longer polypeptide than S. From our forward analysis of process information, however, the genon information in
is higher when the end products that can be derived from it are fewer, that is, more specific than the ones derived from the shorter segment S.
As argued, equation (6) expresses the information provided by the genon, but perhaps not contained in S. Of course, the genon is at least partly superposed on S. But what, then, is the information contained in S itself about the end product derived from it? It turns out that in the present context, this is somewhat more arbitrary to specify. The question means to what extent the possibilities about abstractly possible polypeptides are reduced when we know the sequence S. In order to make this precise and quantifiable, we need to select an ensemble W of polypeptides in which S could possibly be represented. Here, S is conceived as any coding sequence, that is, one whose nucleotide composition is unknown. By determining that composition, we then gain information. But at this point, we are not interested in that chemical composition, but in the identity of the possible end products derived from it. Therefore, determining the nucleotide composition contributes only indirectly to the information desired here. Thus, we need to specify the ensemble of possible end products. This could be the ensemble of all biochemically possible polypeptides—an astronomical number—or the ones that can be derived from the genome in question—about 500 000 in the case of the human genome— or those that actually are made in the given cell—perhaps only several hundred. Of course, there exist other possible choices. In any case, within such an ensemble, each polypeptide x has a relative frequency px and the uncertainty about x then is

When only one single such x can be derived from S, then that is the information contained in S, because this is the amount of uncertainty deduced about the end product by knowing S. Our point, however, is that S does not completely specify that end product, but rather the additional information Igenon from equation (6) is needed. Therefore, the information provided by S is only

According to the cascade principle, the choice of final products is divided into multiple steps. Along these steps, the cisgenon is reduced by RNA processing and the transgenon modified according to cell compartment and physiological context. The relative information from cis and trans can be expressed in similar terms for each of these steps. The segmentation of this process facilitates individual selection of products as there is less uncertainty as the number of possibilities is reduced in each step.
Conclusion
In conclusion, we have provided an information theoretic analysis of the information provided by coding sequences and genons. We have distinguished the sequence information, determined by the nucleotide or amino-acid frequencies and the sequence correlations, the process information contributed by the genon according to the types and numbers of end products that can be derived from a given sequence, and the product information, expressing how much the number of possibilities for a polypeptide is reduced when we know the sequence. The difference between the numbers of possibilities when one does not know or knows the sequence again is the contribution of the genon. This should facilitate a deeper formal understanding of the respective contributions of coding domains and genons. The condition 'sine qua non' to carry out this analysis is to assign a restrictive meaning to the term 'gene' and separate this information from the process information of the genon.

e 22, 04103 Leipzig, Germany. Tel.: +49 341 9959 552; Fax: +49 341 9959 555; E-mail: Email:
14 subunits in variable composition (