Explore

Appendix 2—Search for nt30 oligomer, bootstrapping from the beginning of the PR.C sequence

Statistical artefacts can arise when interpreting sequence data, a common problem when data is selectively chosen which seem to support a favoured evolutionary scenario. A recent common ancestor has been claimed between whales and hippopotamuses,1 and evolutionary relationships between human, chimpanzee and gorilla are claimed to be settled on the basis of DNA protein coding gene comparisons, in spite of the fact that “In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other.”2

In an online essay “Plagiarized Errors and Molecular Genetics”, Dr. Edward E. Max argued that the chances of the same gene-deactivating mutation occurring by chance in different organisms is remote and most likely reveals a common ancestor.3 Misinterpretation can arise by not taking mutational hotspots and the presence of overlapping codes into account. Furthermore, instead of common ancestry, a common designer is a better explanation, revealed through reuse of the similar principles to solve similar challenges.

Too little attention has been invested into how coincidence can mislead researchers when interpreting DNA and protein sequence data. ‘Coincidence’ here refers to nucleotide or protein sequences having patterns not related to the reason being claimed. This potential for error was recently illustrated by Truman and Borger.4

Susumo Ohno is one of the best-known proponents of evolution, so when he wrote,

“All in all, there remains little doubt that the entire base sequence shown in Fig.1 a and b originally arose from repeats of the G+C-rich base oligomer, such as the decamer C-G-A-C-G-C-C-G-C-T”

we decided to investigate how solid this evidence was. It was possible that such a chain of decamers could be an ingenious solution to a particular functional requirement, and the gene designed as such.5 On the other hand, the proposal that a vast number of unrelated genes all evolved from just a few short oligomer repeats over millions of years is not consistent with a creation science model.

Ohno believed that the ancestral PR.C gene consisted of repeats of the decamer CGACGCCGCTC. Three copies would have provided a decapeptide with periodicity Arg-Arg-Arg-Ser-Thr-Pro-Leu-Asp-Ala-Ala. In the subsequent years many bioinformatics tools have become available which permit us to see how plausible this is.

Searching for a linked series of 30nt oligomers in PR.C

All alignments reported here use the Needleman–Wunsch algorithm for optimal pair-wise alignment with default settings for gap opening and gap extension.6

The nucleotide sequence for PR.C is available from the NCBI database under Accession number X00046.1. The nt positions reported in the tables in this paper match the sequence positions published by Ohno7 to facilitate the discussion.

Initially, before insertions and deletions would have occurred, the ancestral PR.C would have consisted of many 30nt (CGACGCCGCTCGACGCCGCTCGACGCCGCT) oligomers (about 43 copies (1286/30) if the putative ancestral gene were as long as today’s). The evidence should be easy to find.

If Ohno is right, we would expect to find a persuasive location for the first nt30 within PR.C. Thereafter the next one within PR.C should be identifiable. We can examine various sliding windows along PR.C. What length should we use? If much shorter than 30 nt positions, we might overlook possible indels, but if too long, we increase the chance of finding a candidate by pure chance. To illustrate the effect of window size, the 30nt sequence was aligned to sections ranging from 30 to 50 nt, beginning at the start codon.8 Table 1 shows how the number of aligned positions increases with window size used, but since the algorithm uses gap penalties, this effect levels off after a certain length (figure 1).

table1
Table 1. Number of nucleotides which align with Ohno’s 30nt oligomer (highlighted in ,span class="color:red">red) as a function of block size within PR.C, beginning at the start position. Positions 30 to 33 are shown here. Results for block sizes ranging from 30–50 nts are available online.7,9 Alignments were made using EMBOSS Needle.6 PR.C from NCBI Accession Nr. X00046.1.10
figure1
Figure 1. Number of aligned nt30 nucleotides as a function of sequence length from the beginning of PR.C. Aligning nt30 with longer portions of PR.C creates more opportunities to find alignments by chance, since indels are optimally inserted, but the effect levels off since the algorithm adds scoring penalties for gaps inserted.

Search for the first nt30 location

To find the first putative 30nt oligomer within PR.C, the best overlap within PR.C was searched for systematically, using sliding windows of length 40-nt, between nt positions 1–79.11 The overview in table 2 shows the results for nt positions 1–65. Sometimes identical alignments were found for different portions of PR.C examined. The locations of the aligned regions from table 2 are displayed graphically in Figure 2.

This data reveals many regions where Ohno’s 30nt oligomer could be aligned, all quite well (which is also true for the rest of PR.C).

table2

Table 2. Best overlap of nt30 with regions from the beginning of PR.C.13 Alignment using EMBOSS Needle.10 PR.C from NCBI12, Accession Nr: X00046.1. See data in ref.13: ID refers to the sliding window. Multiple IDs indicate the same alignment was found for those sliding window regions. ‘Location in PR.C’ documents the location of optimal overlap before indels were inserted. The data for all forty sliding windows in the Excel file led to the following results; 15–23 perfect nucleotide alignments; average = 19.9; σ = 1.67. PR.C positions used including indels: range = 30–47; average = 38.8; σ = 3.7.

Location-within-nt-1-65
Figure 2. Location within nt 1–65 on PR.C of overlaps with Ohno’s 30nt oligomer, using data from table 114

The systematic analysis failed to provide an unambiguous location for the first nt30 within the putative PR.C ancestor. Based on greatest number of nt overlaps, one might select as the best fit example E in table II, with 23 nt overlaps, but the fit relied on five separate residue indel regions comprising thirteen nt, which is absurd.

Extensive further experimentation throughout the entire PR.C sequence did not show a pattern of linked 30nt oligomers. Instead, we observe that reasonable matches can be found in many regions throughout PR.C, and the candidates are usually embedded within other possible good alignments (data not shown).

Ohno’s 30nt oligomer was not proposed based on some evolutionary or chemical theory, but was simply the best oligomer he could find, having the best overlap with the PR.C sequence. It represents one out of 430 = 1018 alternatives, and many other possibilities in addition to the triple decamer, CGACGCCGCT, i.e. oligomers of other lengths were surely also considered.

This is a vast number of candidate sequences which could be proposed for the ancestral PR.C. Since the algorithm will insert the optimal number of indels in the best locations to facilitate alignment, there is significant potential that something is likely to align purely by chance.

Test against null hypothesis: sequences nt30 and PR.C are unrelated

From table II, the best alignment between nt30 and sliding windows of length 40 nt within the PR.C sequence in region 1–79 had an average of 19.9 aligned nt with σ = 1.67. An average of 38.8 nt positions (including all indels) were required.

We compared this with thirty 40-nt random sequences which had the same proportion of nucleotides, A, C, G, and T, as found in PR.C,15 and these were aligned with 30nt. (The random sequences generated were confirmed to reflect the nt distribution of PR.C well).16 An average of 18.3 aligned nt with σ = 2.20 was found (requiring an average of 41.5 nt positions, due to more indels used). Table 3 shows the alignments having 19 or more aligned nt.

table3
Table 3. Randomly generated 40-nt sequences having the same proportion of nt as PR.C, aligned with 30nt17,18. The data for forty sliding windows led to the following results: range = 13–23 aligned nt, average = 18.3; σ = 2.20. Positions used (including indels): range = 30–51; average = 41.5; σ = 5.42.

The two experiments demonstrate that about twenty nts will overlap with nt30 due only to chance (given comparable nucleotide distribution) but apparently the order of the nucleotides within nt30 had been further optimized to match details in PR.C (like accounting for more CG pairs than expected by chance).

Null hypothesis

Based on most nt overlaps, one might select as the best fit Rand_27_(1–40) from table III, having 23 perfect overlaps and relying on six separate gaps comprising nine nt. Note that this random sequence aligns within the beginning of PR.C at least as well as the best nt30 candidate (E in table II, 23 perfect overlaps, five separate gaps composing 13 nt).

However, total number of nt overlaps alone is not the best statsistical criteria in bio-informatic work. The most plausible alignment within the beginning region of PR.C is not with 30nt, but the randomly generated ‘Rand_11_(1–40)’ in table III (22/30 identity using six nt indels):

six-nt-indels

Intuition and statistical coincidence can deceive. This would have seemed like an obvious candidate for homology.

As another example of the risk of being misled by statistical artefacts, see example Rand_15_(1–40) from table III. We assume the gene underwent a six-nt deletion (which is a multiple of three and thus would not create a frame-shift) and what results has 19/24 nt identity with only one indel, including a perfect 8-nt match stretch:

8-nt-match-stretch

The impression of homology is overwhelming. Note that the proportion of A, C, T, and G in the ‘random’ sequences was not optimized for the first 79-nt positions, but reflect the average over PR.C. And thirty is a minuscule proportion of possible random sequences we could have tested.

Conclusion

Ohno proposed that PR.C, and thereby RIIA, originated from a series of linked 30nt oligomers, and identified the best candidate he could find for this 30nt sequence. However, the null test revealed that this oligomer aligns no better than a handful of 40-nt random sections (as long as Ohno’s and the random sequences resemble the proportion of A, C, G, and T found in PR.C).

This particular null test was limited to the first 79-nt positions of PR.C, but additional checks could be performed to include the entire range of PR.C to evaluate the proposal PR.C arose from multiple 30nt. The sequences for this null hypothesis could take into account the presence of many GC pairs (see Appendix 3).

References and notes

  1. Ursing, B.M. and Arnason U., Analyses of mitochondrial genomes strongly support a hippopotamus–whale clade, Proc. Biol. Sci. 265(1412):2251–2255, 1998, doi: 10.1098/rspb.1998.0567. Return to text.
  2. Scally, A. et al., Insights into hominid evolution from the gorilla genome sequence, Nature 483:169–175, 2012, DOI: doi:10.1038/nature10842. Return to text.
  3. www.talkorigins.org/faqs/molgen/. Return to text.
  4. Truman, R. and P. Borger, Why the shared mutations in the Hominidae exon X GULO pseudogene are not evidence for common descent, J. Creation 21(3):118–127, 2007. Return to text.
  5. To illustrate, a wide variety of repetitive residues leading to fairly random polypeptides can hinder freezing of blood; see Davies, P.L. and Hew, C.L., Biochemistry of fish antifreeze proteins, The FASEB J. 4(8):2460–2468, 1990, www.fasebj.org/content/4/8/2460.short.
    Ice-structuring properties are not exclusive to long organic molecules like proteins, glycoproteins or polysaccharides; see Deville, S. et al., Ice Shaping Properties, Similar to That of Antifreeze Proteins, of a Zirconium Acetate Complex, PLOS ONE, 2011, www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0026474. Return to text.
  6. www.ebi.ac.uk/Tools/psa/emboss_needle/nucleotide.html. Return to text.
  7. Ohno, S., Birth of a unique enzyme from an alternative reading frame of the pre-existed, internally repetitious coding sequence, Proc. Natl. Acad. Sci. USA 81:2421–2425, 1984. Return to text.
  8. See nt30 within lengths of PR.C. Return to text.
  9. See nt30 within lengths of PR.C. The number of positions aligned are summarized in sheet ‘Fig. nt30 within length of PR.C’. Return to text.
  10. blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome. Return to text.
  11. See Best first Oligomer. Return to text.
  12. blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome. Return to text.
  13. See Best first Oligomer Overview. Return to text.
  14. See Best first Oligomer Location. Return to text.
  15. See Create random nt sequences’. A: 0.14996; C: 0.37141; G: 0.33023. Random numbers between 0–1 were generated with Excel and used to generate random nt with the function,
    IF (AND(Cn>0,C2<=0.14996), ‘A’, IF (AND(Cn>0.14996,Cn <=0.52137), ‘C’, IF (AND(Cn>0.52137,Cn<=0.8516), ‘G’, IF (AND(Cn>0.8516, Cn<=1), ‘T’,’X’)))) where n is the row number.
    Using the first thirty sequences over nt positions 1–1,287 resulted in average values of: A = 0.1505, σ = 0.0080; C = 0.3724, σ = 0.0136; G = 0.3284, σ = 0.0128; T = 0.1488, σ = 0.0070, which match well the proportions found in PR.C. Return to text.
  16. See Create random nt sequences. Return to text.
  17. See Rand 40-nt aligned w. nt30. Return to text.
  18. See Best Rand 40-nt w. nt30. Return to text.