Splicing and dicing the human genome
Scientists begin to unravel the splicing code
by Robert W. Carter
Published: 1 July 2010(GMT+10)
Compilation of images from iStockphoto
What separates the genomes of simple organisms like sea anemones and jellyfish from
humans? Humans have approximately the same number of protein coding genes as these
lowly creatures,1 yet we
are much more complex organisms. Ignoring the spiritual aspects of humanity, this
complexity difference must be coded within our genomes, but where? Since we share
many genes with many simpler organisms, the answer does not lie in gene content
alone. Rather, the differences are in the non-coding portions of the genome (the
so-called “junk DNA”2)
and in the way the genes are used to create proteins.
Several decades ago, the “one gene-one enzyme” hypothesis was in vogue.
It seemed straightforward that a single protein gene coded for a single protein.
In prokaryotic organisms (bacteria), this was easy to show. The known bacterial
genes had a defined starting and stopping place and the DNA letters in between spelled
out a discrete amino acid sequence. The eukaryotes (organisms with a nucleus; everything
from yeast, to plants, to humans) do not have a simple gene structure. Our protein
genes are broken up into a series of “exons” (the parts that code for
protein) and “introns” (non-coding intervening sequences). To make a
protein, the gene is first transcribed into RNA, then the introns are spliced out,
the exons are stitched together, and the remainder is translated into protein. Even
though complex, the one gene-one enzyme hypothesis was still applied to eukaryotic
protein genes.
Over time, however, it was realized that life was not so simple, especially for
the eukaryotes. The one gene-one enzyme hypothesis was particularly troubling for
the higher (more complex) eukaryotes. For example, the approximately 20,000-25,000
protein-coding genes in the human genome3
are used to create 100,000-300,000 distinct proteins (the actual number is uncertain).
The low number of genes in the human genome was troubling for several reasons.4 First, this means that we
did not have that many more genes than organisms much simpler than us. Second, we
needed a way to create many proteins from few genes and nobody knew how this could
be done on such a large scale. And third, the complexity of the genomic computer
program ratcheted up to even more uncomfortable levels for those who thought we
arose through random chance.
From the ENCODE project, we learned that alternate splicing is so pervasive that
the definition of the word ‘gene’ is currently under debate.
Even before Human Genome Project5
was complete, we knew that some proteins are manufactured through a process called
“alternate splicing”, where exons from different locations in the genome
are combined to create many different proteins. From the ENCODE project,6 we learned that alternate splicing is so pervasive
that the definition of the word “gene” is currently under debate.7 Thus, the one gene-one enzyme
hypothesis turned out to be a gross oversimplification. However, the word and the
concept of a “gene” is so useful that for the rest of this article I
will be referring to “genes” in the classic sense as a contiguous stretch
of DNA with a starting and ending location and a set of introns and exons that could
potentially be transcribed, spliced, and translated into a single protein. Each
gene, however is made of parts that can be recombined with parts from other genes
in different locations in the genome to create proteins not coded by any specific
gene.
Alternate splicing is a brilliant design concept that allows for a streamlined genetic
program that takes up a fraction of the space compared to a program that coded for
each protein independently. But this added complexity comes at a price. It has been
conservatively estimated that each intron adds the same amount of complexity as
approximately 30 additional DNA letters.8
Thus, the “mutation target” for a gene is increased for each intron
added. Consider that the average protein-coding gene has 7-10 introns and that the
total length of introns is often longer than the total length of protein coding
DNA, and one can see why this is a problem. It takes a lot to maintain such a system
and the complexity makes it difficult for naturalistic theories of origins. In fact,
a sizeable proportion of human genetic disease has been attributed to mutations
within intron-exon splice sites.9
Introns are typically included in the junk DNA category, but they have specific
sequences at the head and tail ends that tell the splicing mechanism where to cut,
etc., so they are not without function. (Exons also have splice signals at their
ends. Thus, some of the information for splicing out the introns is found within
the protein-coding portion of the genome. The protein-coding sections code for both
protein sequence and splicing patterns at the same time!)
The ENCODE project made the significant discovery that nearly all of the genome
was turned into RNA at some point in the life of a cell and that multiple overlapping
RNAs were often created from the same stretch of DNA. This was a tremendous blow
to junk DNA theorists.10
However, perhaps more importantly, the ENCODE results also documented an amazing
amount of alternate splicing. So, here we were, knowing that a huge portion of the
genome is active and that the protein-coding portions were being used in complex
combinations, but we still did not know how it all came together. Because of this,
scientists have been looking for a “splicing code” within the genome
that controls the slicing and dicing of the protein genes. This splicing code must
account for 1) the complex combinations of exons needed to create hundreds of thousands
of proteins from tens of thousands of protein genes, 2) the variation in splicing
from cell to cell needed to account for the different proteins expressed in different
cell types, and 3) changes in splicing patterns over time as the organism proceeds
from fertilized egg to adult (since not all genes are active at all stages in the
life cycle). All this information must be coded in the genome, but it also cannot
interfere with the protein-coding domains. Thus, most of this information must reside
within the introns and in the spaces between genes.
A paper recently appeared in Nature where the authors claimed to have discovered
the beginning of the splicing code. What they found is a marvel of complexity. Science
labs across the world have been generating tremendous amounts of data and they were
able to capitalize on this new knowledge in a massive data mining exercise. Specifically,
vast databases have been compiled that tell us which genes are active in different
cell lines and at different stages of development. We also know of many DNA-binding
factors and their specific sequence targets (usually a short string of very precise
letters that are targeted by proteins with whimsical names like “Star”,
“Nova”, and “Quaking-like”). With this knowledge, they were
able to approach the issue statistically to document significant features that help
to control alternate splicing. They found many “motifs” (short DNA words
of 5-10 letters each) before and after many exons that were strongly associated
with different cell types. In all, they could explain 60% of the alternate splicing
patterns found in the human genome just by the presence or absence of these motifs.
Many of the motifs were known previously and are sites for known DNA-binding proteins.
Many other motifs were new to science.
The median number of tissue-specific motifs associated with splicing, per exon,
ranged from 12 for the central nervous system and 19 for embryo.11 There were additional tissue-independent features
associated with most or all exons and additional and abundant short motifs that
were not considered in the above counts. This means the splicing code is complex
and that complex combinations of instructions are needed to control how the many
exons combine to produce the multitude of proteins found in the human body.
They also discovered features related to splicing much farther away from the protein-coding
regions than they expected. Because of technical limitations, most studies on transcription
regulation have historically focused on a few dozen letters immediately upstream
or downstream of a target sequence. Here, they document features much further into
non-coding regions than previously known (up to 300 letters away). Thus, even more
junk DNA has been subsumed into the functional DNA category!
God wrote a genetic computer program that is, to date, unsurpassed by any human
technology.
But this is only the beginning. They have only scratched the surface and have already
discovered amazing complexity. They only managed a prediction accuracy of 60%. Therefore,
much remains to be discovered. Where is the missing information? Perhaps it will
be found deeper into the non-coding DNA. Perhaps, because they did not consider
the 3-D architecture of the DNA within the nucleus, additional features may be discovered
much farther away or even on different chromosomes! The possibilities are endless
and we will certainly update you as more is learned.
There is one final implication of this work I would like to discuss. There are many
“pseudogenes” in the genome that look like functional genes but have
“mutations” that prevent them from being turned into proteins. The presence
of pseudogenes has been an enigma since their discovery, but the idea has generally
been used to attack creationists and other advocates of design. I believe the arguments
are spurious12 and we
have written much about them in prior articles.13
Even though functions have been found for many pseudogenes, it is true that, if
transcribed and spliced, a pseudogene cannot be translated into a protein. However,
now that we are aware of alternate splicing, future work may show that many of the
pseudogene exons are incorporated into functional proteins. If so, the entire pseudogene
argument will collapse like a house of cards. But, only time will tell.
For now, let us be amazed at the amazingly engineered human genome. God wrote a
genetic computer program that is, to date, unsurpassed by any human technology.
The wisdom and foresight that went into it is nothing short of stunning. He engineered
a string of DNA as long as a person is tall that could withstand thousands of errors
(mutations), adapt to changing environments (through self-modifying code that turns
different genes on and off, depending on conditions), and that can be packed into
a microscopic cell without forming knots! Now we learn that his program is a wonder
of data compression and efficiency. It is more sophisticated than anything we have
ever contemplated.
Related articles
Further reading
References
- Putnam, N.H., et al., Sea anemone genome reveals
ancestral Eumetazoan gene repertoire and genomic organization, Science
317:86–94. Return to text.
- Carter, R.W.,
The slow, painful death of junk DNA. Return to text.
- Pennisi, E., Gene counters struggle to get the right answer,
Science 301:1040–1041, 2003. Return
to text.
- Claverie, J. Gene number. What if there are only 30,000 human
genes? Science 291:1255–1257, 2001.
Return to text.
- International Human Genome Sequencing Consortium, Initial
sequence and analysis of the human genome, Nature 409(6822):860–921,
2001. Return to text.
- ENCODE Project Consortium, Identification and analysis of
functional elements in 1% of the human genome by the ENCODE pilot project, Nature
447:799–816. Return to text.
- Gerstein, M.B., What is a gene, post-ENCODE? History and updated
definition, Genome Research 17:669–681, 2007. Return to text.
- Lynch, M., Rate, molecular spectrum, and consequences of human
mutation, Proceedings of the National Academy of Sciences USA 107(3):961–968,
2010. Return to text.
- Barash, Y., et al., Deciphering the splicing code,
Nature 465:53–59, 2010. Return to
text.
- Williams, A.,
Astonishing DNA complexity update. Return to text.
- The use of human embryo data is highly disturbing to me, but this article is not about the ethical, moral, or spiritual ramifications of the “brave” new world of modern science so I will refrain from further comment.
Return to text.
- The Great Dothan Debate
Return to text.
- For a list of articles on pseudogenes, see the Junk DNA section of the
Vestigial Organs Questions and Answers page. Return to text.
| Did you notice that there weren’t any ads or annoying page-covering pop ups on our site? Consider undergirding our efforts with a small donation today!  | | |
|