Splicing and dicing the human genome

Scientists begin to unravel the splicing code

What separates the genomes of simple organisms like sea anemones and jellyfish from humans? Humans have approximately the same number of protein coding genes as these lowly creatures,¹ yet we are much more complex organisms. Ignoring the spiritual aspects of humanity, this complexity difference must be coded within our genomes, but where? Since we share many genes with many simpler organisms, the answer does not lie in gene content alone. Rather, the differences are in the non-coding portions of the genome (the so-called “junk DNA”²) and in the way the genes are used to create proteins.

Several decades ago, the “one gene-one enzyme” hypothesis was in vogue. It seemed straightforward that a single protein gene coded for a single protein. In prokaryotic organisms (bacteria), this was easy to show. The known bacterial genes had a defined starting and stopping place and the DNA letters in between spelled out a discrete amino acid sequence. The eukaryotes (organisms with a nucleus; everything from yeast, to plants, to humans) do not have a simple gene structure. Our protein genes are broken up into a series of “exons” (the parts that code for protein) and “introns” (non-coding intervening sequences). To make a protein, the gene is first transcribed into RNA, then the introns are spliced out, the exons are stitched together, and the remainder is translated into protein. Even though complex, the one gene-one enzyme hypothesis was still applied to eukaryotic protein genes.

Over time, however, it was realized that life was not so simple, especially for the eukaryotes. The one gene-one enzyme hypothesis was particularly troubling for the higher (more complex) eukaryotes. For example, the approximately 20,000–25,000 protein-coding genes in the human genome³ are used to create 100,000–300,000 distinct proteins (the actual number is uncertain). The low number of genes in the human genome was troubling for several reasons.⁴ First, this means that we did not have that many more genes than organisms much simpler than us. Second, we needed a way to create many proteins from few genes and nobody knew how this could be done on such a large scale. And third, the complexity of the genomic computer program ratcheted up to even more uncomfortable levels for those who thought we arose through random chance.

Even before Human Genome Project⁵ was complete, we knew that some proteins are manufactured through a process called “alternate splicing”, where exons from different locations in the genome are combined to create many different proteins. From the ENCODE project,⁶ we learned that alternate splicing is so pervasive that the definition of the word “gene” is currently under debate.⁷ Thus, the one gene-one enzyme hypothesis turned out to be a gross oversimplification. However, the word and the concept of a “gene” is so useful that for the rest of this article I will be referring to “genes” in the classic sense as a contiguous stretch of DNA with a starting and ending location and a set of introns and exons that could potentially be transcribed, spliced, and translated into a single protein. Each gene, however is made of parts that can be recombined with parts from other genes in different locations in the genome to create proteins not coded by any specific gene.

Alternate splicing is a brilliant design concept that allows for a streamlined genetic program that takes up a fraction of the space compared to a program that coded for each protein independently. But this added complexity comes at a price. It has been conservatively estimated that each intron adds the same amount of complexity as approximately 30 additional DNA letters.⁸ Thus, the “mutation target” for a gene is increased for each intron added. Consider that the average protein-coding gene has 7–10 introns and that the total length of introns is often longer than the total length of protein coding DNA, and one can see why this is a problem. It takes a lot to maintain such a system and the complexity makes it difficult for naturalistic theories of origins. In fact, a sizeable proportion of human genetic disease has been attributed to mutations within intron-exon splice sites.⁹ Introns are typically included in the junk DNA category, but they have specific sequences at the head and tail ends that tell the splicing mechanism where to cut, etc., so they are not without function. (Exons also have splice signals at their ends. Thus, some of the information for splicing out the introns is found within the protein-coding portion of the genome. The protein-coding sections code for both protein sequence and splicing patterns at the same time!)

The ENCODE project made the significant discovery that nearly all of the genome was turned into RNA at some point in the life of a cell and that multiple overlapping RNAs were often created from the same stretch of DNA. This was a tremendous blow to junk DNA theorists.¹⁰ However, perhaps more importantly, the ENCODE results also documented an amazing amount of alternate splicing. So, here we were, knowing that a huge portion of the genome is active and that the protein-coding portions were being used in complex combinations, but we still did not know how it all came together. Because of this, scientists have been looking for a “splicing code” within the genome that controls the slicing and dicing of the protein genes. This splicing code must account for 1) the complex combinations of exons needed to create hundreds of thousands of proteins from tens of thousands of protein genes, 2) the variation in splicing from cell to cell needed to account for the different proteins expressed in different cell types, and 3) changes in splicing patterns over time as the organism proceeds from fertilized egg to adult (since not all genes are active at all stages in the life cycle). All this information must be coded in the genome, but it also cannot interfere with the protein-coding domains. Thus, most of this information must reside within the introns and in the spaces between genes.

A paper recently appeared in Nature where the authors claimed to have discovered the beginning of the splicing code. What they found is a marvel of complexity. Science labs across the world have been generating tremendous amounts of data and they were able to capitalize on this new knowledge in a massive data mining exercise. Specifically, vast databases have been compiled that tell us which genes are active in different cell lines and at different stages of development. We also know of many DNA-binding factors and their specific sequence targets (usually a short string of very precise letters that are targeted by proteins with whimsical names like “Star”, “Nova”, and “Quaking-like”). With this knowledge, they were able to approach the issue statistically to document significant features that help to control alternate splicing. They found many “motifs” (short DNA words of 5–10 letters each) before and after many exons that were strongly associated with different cell types. In all, they could explain 60% of the alternate splicing patterns found in the human genome just by the presence or absence of these motifs. Many of the motifs were known previously and are sites for known DNA-binding proteins. Many other motifs were new to science.

The median number of tissue-specific motifs associated with splicing, per exon, ranged from 12 for the central nervous system and 19 for embryo.¹¹ There were additional tissue-independent features associated with most or all exons and additional and abundant short motifs that were not considered in the above counts. This means the splicing code is complex and that complex combinations of instructions are needed to control how the many exons combine to produce the multitude of proteins found in the human body.

They also discovered features related to splicing much farther away from the protein-coding regions than they expected. Because of technical limitations, most studies on transcription regulation have historically focused on a few dozen letters immediately upstream or downstream of a target sequence. Here, they document features much further into non-coding regions than previously known (up to 300 letters away). Thus, even more junk DNA has been subsumed into the functional DNA category!

But this is only the beginning. They have only scratched the surface and have already discovered amazing complexity. They only managed a prediction accuracy of 60%. Therefore, much remains to be discovered. Where is the missing information? Perhaps it will be found deeper into the non-coding DNA. Perhaps, because they did not consider the 3-D architecture of the DNA within the nucleus, additional features may be discovered much farther away or even on different chromosomes! The possibilities are endless and we will certainly update you as more is learned.

There is one final implication of this work I would like to discuss. There are many “pseudogenes” in the genome that look like functional genes but have “mutations” that prevent them from being turned into proteins. The presence of pseudogenes has been an enigma since their discovery, but the idea has generally been used to attack creationists and other advocates of design. I believe the arguments are spurious¹² and we have written much about them in prior articles.¹³ Even though functions have been found for many pseudogenes, it is true that, if transcribed and spliced, a pseudogene cannot be translated into a protein. However, now that we are aware of alternate splicing, future work may show that many of the pseudogene exons are incorporated into functional proteins. If so, the entire pseudogene argument will collapse like a house of cards. But, only time will tell.

For now, let us be amazed at the amazingly engineered human genome. God wrote a genetic computer program that is, to date, unsurpassed by any human technology. The wisdom and foresight that went into it is nothing short of stunning. He engineered a string of DNA as long as a person is tall that could withstand thousands of errors (mutations), adapt to changing environments (through self-modifying code that turns different genes on and off, depending on conditions), and that can be packed into a microscopic cell without forming knots! Now we learn that his program is a wonder of data compression and efficiency. It is more sophisticated than anything we have ever contemplated.

Published: 29 June 2010

References

Putnam, N.H., et al., Sea anemone genome reveals ancestral Eumetazoan gene repertoire and genomic organization, Science 317:86–94. Return to text.
Carter, R.W., The slow, painful death of junk DNA. Return to text.
Pennisi, E., Gene counters struggle to get the right answer, Science 301:1040–1041, 2003. Return to text.
Claverie, J. Gene number. What if there are only 30,000 human genes? Science 291:1255–1257, 2001. Return to text.
International Human Genome Sequencing Consortium, Initial sequence and analysis of the human genome, Nature 409(6822):860–921, 2001. Return to text.
ENCODE Project Consortium, Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project, Nature 447:799–816. Return to text.
Gerstein, M.B., What is a gene, post-ENCODE? History and updated definition, Genome Research 17:669–681, 2007. Return to text.
Lynch, M., Rate, molecular spectrum, and consequences of human mutation, Proceedings of the National Academy of Sciences USA 107(3):961–968, 2010. Return to text.
Barash, Y., et al., Deciphering the splicing code, Nature 465:53–59, 2010. Return to text.
Williams, A., Astonishing DNA complexity update. Return to text.
The use of human embryo data is highly disturbing to me, but this article is not about the ethical, moral, or spiritual ramifications of the “brave” new world of modern science so I will refrain from further comment. Return to text.
The Great Dothan Debate Return to text.
For a list of articles on pseudogenes, see the Junk DNA section of the Vestigial Organs Questions and Answers page. Return to text.