What proportion of the human genome is actually functional?

And how much variation is tolerable?


 Lockheed PV-1 Ventura demonstrating survivorship biasCredit: Zephyris, CC BY-SA 3.0, via Wikimedia Commons.

Anyone who’s been paying attention to the science of genomics knows that the question, “How much of the genome is functional?” is hugely contentious. The ‘junk DNA’ hypothesis has been the ruling paradigm for decades, and recent discoveries that highlight the functionality of the non-coding regions of the genome have been slow to win acceptance. It is true that only about 2% of the genome codes for protein. It is also true that a mutation in the protein coding region is far more likely to have a noticeable effect.

But it is also true that most of the work in the genome occurs at the level of RNA, not protein. The majority of the genome is transcribed into RNA, but that RNA is then used to control a myriad of functions in the cell, a minority of which involve protein production. Is this because the genome has a lot of parasitic DNA (i.e., functionless DNA that is only retained because it can replicate itself)? That makes no sense from a creation standpoint, of course, but it also makes little sense from an evolutionary standpoint. How could most of the RNA that is being produced do nothing? Worse, would not a massive production of useless RNA be a waste of precious resources? Would not natural selection be constantly honing the genome by removing the parasitic junk?

The answer to the question has been a long time in coming. It was no trivial thing to figure out how much of the genome is functional, and we still don’t know everything. But it is not like we know nothing, and geneticists do have ways of assessing functionality:

  • Classically, evolutionary scientists looked for shared regions among the genomes of distantly related species. They reasoned that if something has remained the same for so long, it must be important. This is called the ‘phylogenetic’ method, and it has been a powerful tool in the hands of the evolutionary community in convincing people that evolution is true. There are limitations to their reasoning, and more will be revealed below, but, briefly, shared DNA elements can often be deleted in model organisms with little to no noticeable effects. If the thing can be deleted, how can someone claim it must be functional?
  • Another way to estimate functionality is to compare disease states to DNA. This is now often done using machine learning. These genome-wide association studies (GWAS) have been used to map specific mutations to specific diseases. Unsurprisingly, most of the disease-causing mutations have been located in the protein-coding region. Yet, GWAS has often failed to identify a specific genetic factor. We now know that many traits are caused by a combination of genes acting in concert, that many people are walking around in a perfectly healthy state even though they carry genes that killed other people, and that many diseases are caused by mutations in the non-coding areas of the genome.
  • Yet another method is to look at different areas in the genome and see which ones deviate from some expected pattern. Granted, the pattern one expects to see is very much dependent on some overarching model, but this at least seems reasonable. Think about it, the genome is composed of only four letters. A and T make up about 60% of the genome and C and G make up the other 40%. When we find areas that diverge from this basic pattern, we can infer a lot (for example, protein coding genes are typically GC rich). Likewise, when we find a stretch of DNA that shows little variation from person to person, we might infer that this area is extremely important and that mutations in that region are more likely to be deleterious to the individual.

Using this last method to identify functional constraints is similar to a now-famous story from World War II. In 1943, a study was done by the Allies1 on the bullet holes seen in planes that returned to base. There was a very clear pattern, and the initial thought was to fortify those areas that were more likely to have holes. Then someone realized that the opposite was true. The areas with no holes were the most vital areas on the plane. The reason they almost never saw planes with holes in those areas was because very few planes that were hit in those places returned to base. This is called survivorship bias.

 Credit: Martin Grandjean (vector), McGeddon (picture), Cameron Moll (concept). Lockheed PV-1 Ventura demonstrating survivorship bias
An illustration of survivorship bias. This theoretical reconstruction is based on an unillustrated report by Abraham Wald from 1943. Pictured is a Lockheed PV-1 Ventura.

Looking for areas of the genome that have less variation than expected can tell us that these areas are more important than others. This cannot, however, tell us that the rest is not important. The specific shape of a tail elevator in a small plane, for example might not be critical, but the presence of that elevator certainly is. The pilot can manually adjust to many different flying conditions, but without at least the semblance of a functional tail elevator it would be impossible to control the pitch (i.e., pointing up/down) of the plane. In the same way, the cell can adjust to many different genomic conditions. If a critical gene is mutated, the cell will die, but just because some gene can be mutated does not mean it has no function. This is a critical distinction to keep in mind.

Assessing functionality

Until recently, we did not have a lot of data with which to work. The first draft of the human genome was not finished until 2001, and even though genetic testing companies have amassed millions of samples, their information is both private and incomplete. We need full-length, high-quality genomic data if we are going to see which parts of the human genome vary from person to person, and which don’t. After that, we might be able to apply some mutation model to the data and see which places significantly differ from what is expected. There are other challenges, however. For example, protein coding regions are grossly overrepresented among the known genes under purifying selection,2 and the mutation rate in non-coding region is inconsistent. Thus, we need a lot of good quality genomes and a sophisticated mutation model to carry out such studies.

This is exactly what the Genome Aggregation Database Consortium has done. In a paper that lists over 200 co-authors, Chen et al. report on a computerized system for identifying functional regions of the genome.3 They call it GNOCCI and they used it to create a genome-wide map of human constraint. ‘Constraint’ is a measure of mutability. They made the sensible assumption that areas that had significantly less mutations than expected must be functional.

They started with over 150,000 genomes and selected the ones with the highest quality sequence scores. Ending up with more than 76,000 high-quality human genomes, they then identified 644 million places (out of 3.1 billion letters) in the genome where variation exists, amounting to one variant in about every 5 letters and one variant every 8 letters after removing the high-frequency ones.

Once they had this map, they set to work identifying the constrained areas by applying a sophisticated mutation model. As they stated:

One key challenge in quantifying non-coding constraint is the estimation of the true base mutation rate, which can be affected by various genomic phenomena, potentially operating at different scales.

Their new mutation model takes into account each three-letter trinucleotide, the location and amount of methylation in different areas of the genome,4 and ‘regional genomic features’. After dividing the genome into many non-overlapping, 100-base pair ‘bins’, they used this model to predict the level of variation expected under neutrality (i.e., with no selection present) in each bin. They found a consistent signal: the coding regions (even bins that contain but a single coding nucleotide) were more likely to have less variation. These regions were more like to “diverge from neutrality” and they were “more constrained”. They wrote:

As expected, the average constraint for protein-coding sequences is stronger than for non-coding regions.

Within the non-coding areas of the genome, they did find many “constrained” regions. These were enriched for known regulatory elements. In other words, we already knew about many of the important bits within the non-coding DNA. But they also determined that the most constrained regions of the non-coding DNA were more likely to be associated with more constrained protein-coding genes.

Nerdy details: Only a small fraction of non-coding bins (3.9%) approached the average score of the coding bins, and the tiniest fraction of non-coding space (0.05%) was found to be as constrained as the most constrained protein-coding areas. They found that cis-regulatory elements were significantly enriched in the most conserved areas. Other significant areas were gene promoters and enhancers, of which 10.2% and 6.3%, respectively, were under as much constraint as the average protein-coding bin. A much higher percentage of strong constraint (22.2%) was found for microRNA genes. Among the long non-coding RNA genes, even though many of them have the form of protein-coding genes, and even though they are actively transcribed and alternately spliced, only 3.7% were strongly constrained.

When they focused on genetic variants that are known to have an effect (e.g., disease-causing mutations), they identified 13,000 known trait pairs and an additional 164 that were previously unknown.

Failure of phylogeny

Comparing their work to phylogeny-based approaches, they found that conservation scores were weakly correlated with constraint. This is an incredible result! This means that shared similarities have nothing to do with evolution and all those decades of speculation amounted to nothing more than hot air.

Low sequence specificity or non-functionality?

How functional is the wall of your home? In one sense, very. It keeps out the heat in summer and cold in winter. It holds up the roof. It prevents neighbours from seeing you while you sleep. OK, but is the thickness of the wall important? What about the colour? How about the number of windows? A structural feature can be highly adaptable and still perform its main function(s).

The same is true of the different designed features of our genomes. There are certain areas where variation is simply not tolerated. There are other areas where variation can abound. There are stretches of DNA that must not be touched and other areas with the sole purpose of providing scaffolding for the important bits. Scaffolding? Let me explain. The genome is a four-dimensional entity. The linear strings of DNA (chromosomes) code for a vast two-dimensional interaction network, but they also fold into three-dimensional shapes that change over the fourth dimension, time. Yet, since different genes are designed to work together, they tend to be found together, not along the same chromosome, but in 3-D space after the chromosomes are folded into position. Thus, some huge fraction of the genome is there only to hold genes in place. Many of these areas are free to mutate, but they are certainly important.

There are short sections of the genome that tend to twist the wrong way. Instead of the classic right-handed DNA double helix, these places can form an irregular left-handed helix called Z-DNA. Recent analyses have showed that cancer-causing mutations are often found in Z-DNA, that Z-DNA is often found just upstream of protein-coding genes, that it is involved in the innate immune system, and that we produce at least one protein (zbp1) that is designed to bind to DNA in the Z conformation.5 Just under 1% of the genome can take on this configuration. These areas may not be able to make protein, and the exact sequence might not be critical (except for those letters where changing them might cause cancer), but they are still highly functional.

Consider also the function of the abundant non-coding RNAs that are produced by our cells. Since they do not code for protein, this specific constraint is removed. Yet, even though the sequence of the RNA might not be highly constrained, they do so many things in the cell that it would be hard to list them all.6 Some RNAs fold into a structure that is used, for example, in the ribosome (the apparatus that makes proteins from messenger RNA, i.e., translation). Some serve as the bridge between mRNA and protein (transfer RNAs). Some are important for chromatin7 remodelling. Some serve as gene regulators (enhancers and repressors of transcription, i.e., the process of making RNA from DNA).

One of the most important roles for non-coding RNA is to regulate protein synthesis. They can regulate transcription by sticking to the DNA at a complementary sequence (for example, the functional counterpart of a pseudogene). This prevents that region of DNA from being transcribed into messenger RNA and thus the protein cannot be made. They can also regulate translation by either sticking to a complementary RNA (and thus preventing it from being translated into a protein) or by allowing the protein translation apparatus to attach to it (thereby competing for attention and slowing the rate of translation of the complementary RNA). In each of these cases, the exact sequence is not very important. Even if several bases mismatch between the RNA and the target DNA or RNA, the two will still align and stick if the sequence is long enough, and as long as enough bases match. Thus, here we have a case of profoundly important functionality but with flexibility in the code. The Chen et al. study was not designed to detect such things.

What is the answer?

In the end, this new study, amazing as it is, was only able to highlight those areas of the genome that are not free to mutate. This is but one measure of functionality. The study could not tell us how much of the genome is functional, only how much of the sequence is tightly constrained. It could also not tell us how much mutation can accrue before the genome stops working. In the end, we still don’t know how much of the genome is functional.

WWII fighter planes could withstand heaps of abuse, with parts being shot away and other parts being riddled with holes, and still be intact enough to fly home. Why? Because they were well designed. In the same way, the human genome was designed to withstand heaps of abuse. Humans have picked up millions of mutations (thousands per individual, many millions across the population) over the past several thousand years, yet we still live. There is redundancy built right into the heart of the system (e.g., we have two copies of the genome) and very often the cell can avoid a broken system by taking an alternate pathway.

We are much more complicated than any human-made machine. Our design testifies to the brilliance of our Creator. The design of the human, or indeed any, genome also argues strongly against evolution, for how could a blind system of accidental errors ever produce a highly complex system that was both robust to error and adaptable? Instead of being created from the bottom up, life was created from the top down. God is the creator of life. He designed all those parts, brought them together, filled the cell with energy, and then ‘let go’. And life has been humming along ever since.

Published: 14 March 2024

References and notes

  1. The Allies were a coalition of nations (principally the UK, the Soviet Union [Russia + satellite nations], China, and the US) who were fighting the Axis (principally Germany, Italy, and Austria) during World War II, which lasted from 1939 to 1945. Return to text.
  2. Claims of ‘selection’ are often couched in evolutionary assumptions, including deep time. Yet there are clear cases where survivorship is linked to variations in some important gene. This is not a problem from a creation perspective. But there are two forms of selection: purifying and positive. Purifying selection, the removal of bad alleles, is the focus of this article. Positive selection, the amplification of new mutations, is more rare, more contentious, and more important to Darwinism. Return to text.
  3. Chen, S., et al., A genomic mutational constraint map using variation in 76,156 human genomes, Nature 625:92–100, 2023. Return to text.
  4. Methylation is a type of epigenetic marker but does not alter the nucleotide sequence itself. See Carter, R.W., Darwin’s Lamarckism vindicated?, 1 Mar 2011. Return to text.
  5. Brazil, R., More than a mirror-image: left-handed nucleic acids, chemistryworld.com, 5 Feb 2024. Return to text.
  6. E.g., see Kaikkonen, M.U., Lam, M.T.Y., and Glass, C.K., Non-coding RNAs as regulators of gene expression and epigenetics, Cardiovasc. Res. 90(3):430–440, 2011. Return to text.
  7. In the nucleus, DNA is wrapped around histone protein complexes. The combination of DNA + protein is called chromatin. Return to text.