Computational genomics of photosynthetic organisms

Molecular Evolution and Phylogenetics

“Phylogenetic reconstruction is a fast-growing field that is enriched by different statistical approaches and by findings and applications in a broad range of biological areas. Fundamental to these are the mathematical models used to describe the patterns of DNA base substitution and amino acid replacement. These may become some of the basic models for comparative genome research.”

—Pietro Liò and Nick Goldman [LIO1998].

[LIO1998]Liò, P. and Goldman, N. (1998) Models of Molecular Evolution and Phylogeny. Genome Res. 1998 8: 1233-1244. doi:10.1101/gr.8.12.1233


The comparison between genome sequences of different organisms of the same species or from a different species show that these are not static but change or mutate over their evolutionary history. Changes or mutations in a genome can occur due to a variety of causes such as errors in DNA replication or external factors like UV rays. These changes or mutations may be neutral, defective or advantageous for adaptation, survival and reproduction of the relevant body. If a mutation occurs in the germline of an organism can be transmitted to their descendants. Thus a mutation that is neutral or advantageous for the reproduction of an organism can spread in a population and resulting fixed polymorphisms. A polymorphism is the existence of different variants of a DNA sequence called alleles.

The most frequent polymorphisms are SNPs, single nucleotide polymorphisms. These variations consist of a single nucleotide change by another between alleles. The STR, short tandem repeats, are the second most frequent polymorphisms. This consists of repeating a different number of times in different alleles of short sequences of DNA. Finally, indels, transpositions, inversions and duplications may appear as polymorphisms rarely.

Molecular Phylogeny

The phylogeny is the branch of biology focused on the inference of evolutionary relationships between existing species. Traditionally, the study was based on morphological and physiological character. In early 1980 with the acquisition of the first gene and protein sequences raises the molecular phylogeny. This discipline is based on a comparison of biological sequences to perform a hierarchical classification between existing species.

The main objectives of the molecular phylogeny are:

  • Determine a hierarchical relationship between existing species according to their evolutionary relationship. So that organisms that share a common ancestor are grouped close earlier than those with a more distant common ancestor.
  • Estimating the divergence time between species, i.e., the time of existence of the nearest common ancestor.

Phylogenetic Analysis

The determination of the hierarchical relationship between species and divergence time estimation among them based on biological sequences such as proteins or DNA is performed by phylogenetic studies. Phylogenetic studies typically are divided into five distinct phases:

  • Phase 1: Selecting of the biological sequences to be analyzed.
  • Phase 2: Building a multiple alignment of the biological sequences selected.
  • Phase 3: Selecting the substitution models, statistical models of molecular evolution, for the corresponding sequences.
  • Phase 4: Building the phylogenetic trees based on the corresponding multiple alignments and substitution models.
  • Phase 5: Statistical evaluation of the phylogenetic trees.

These phases are not organized in a linear fashion. It is common to have to go back to previous phases to review some decisions made before advancing to the next phase. It is also frequent complete the study for various possible choices in the different phases and compare results before making a final decision on the phylogenetic analysis performed. For example, normally will be taken several multiple alignments in phase 2 with different parameters, to explore different substitution models in phase 3 and to build phylogenetic trees with several different methods in phase 4. Different results will be independently evaluated in phase 5 and compared together for, based on existing literature and biological information available, making a final decision on the most informative analysis.

Phylogenetic Analysis

Selecting the biological sequences

Phylogenetic studies are based on the comparison of homologous biological sequences. The study of the differences between them allows to estimate the evolutionary relationship between the corresponding species and their divergence time.

To obtain homologous sequences, databases such as RefSeq, Uniprot or HomoloGene can be very useful. Another way to obtain homologous sequences is through alignments using BLAST [1] results.

Multiple alignment of the biological sequences

The multiple alignment of the analyzed biological sequences is the most important step in phylogenetic studies as it involves the comparison between the different sequences.

Is necessary to check the following points:

  • Remove non-homologous sequences, those that show no alignment. There must be a pair sequence alignment and assess their significance.
  • If alignment is not good and it is certain of homology between sequences, the parameters of insertion penalty and gap extension must be modified.
  • Normally there is no known of the whole sequence of those corresponding sequences and there are many gaps. Is necessary to eliminate the columns corresponding to gap.
  • Duality between multiple alignment and phylogenetic tree.

There are several tools to perform multiple sequence alignment, MEGA [2] is one of them. This tool allows to align multiple sequence via MUSCLE [3] and ClustalW [4] alignment methods

Selecting the substitution models

After completing the multiple alignment, is possible to estimate the genetic distance between the different sequences. The genetic distance between two homologous sequences is defined as the number of substitutions accumulated between them since they diverged from a common ancestor. Estimation of genetic distance is not trivial since not all substitutions are observable especially in sequences with many substitutions.

Figure 6.1 shows why not all substitutions are observable, there are actual substitutions that cannot be observed.


Figure 6.1: Observed substitutions versus actual substitutions

Phylogenetic analysis is based on the correct choice of the appropriate substitution model. There are several substitution models, the most commons are:

  • JC69 model (Jukes and Cantor, 1969) [JUKESCANTOR1969]: The Jukes-Cantor model is the simplest model that proposes a correction of the number of observed substitutions. Assumes that the probability of mutating a nucleotide for another is independent of the position of said nucleotide and the nucleotide itself:

    • The probability of changing A by C, G or T is identical to alpha/3.
    • In the same way for C, G and T.
    Substitution models

    Figure 6.2: Jukes and Cantor’s rate matrix.

  • K80 model (Kimura, 1980) [KIMURA1980]: Kimura proposed a refinement of the Jukes-Cantor model which takes into account the greater probability of observing transitions than observing transversions. Therefore depends on two parameters:

    • The probability of observing a transition, alpha.
    • The probability of observing a transversion, beta.
    Substitution models

    Figure 6.3: Kimura’s rate matrix.

  • T92 model (Tamura 1992) [TAMURA1992]: T92 is a simple mathematical method developed to estimate the number of nucleotide substitutions per site between two DNA sequences, by extending Kimura’s (1980) two-parameter method to the case where a G+C-content bias exists. This method will be useful when there are strong transition-transversion and G+C-content biases.

  • TN93 model (Tamura and Nei 1993) [TAMURANEI1993]: The TN93 model distinguishes between the two different types of transition, i.e., (A <-> G) is allowed to have a different rate to (C<->T). Transversions are all assumed to occur at the same rate, but that rate is allowed to be different from both of the rates for transitions.

Building phylogenetic trees

The main objects of study in molecular phylogeny: the establishment of hierarchical relationships between species according to their evolutionary relationship and the estimated time of divergence between species are represented by using phylogenetic trees.

Substitution models

Figure 6.4: Parts of a phylogenetic tree

There are two types of trees according to the existence of an outstanding node called root: Rooted trees have a node called root, which corresponds to the common ancestor of all the taxa. In rooted trees can be establish a relationship of temporality. On the other hand unrooted trees lack of a root node, for that reason can not establish a temporal relationship.

There are mainly two methods to determine the root of an unrooted tree:

  • Added an outgroup, a taxon that is known to be the farthest from the rest, and determine the root at the midpoint of the branch which joins the clade composed of the remaining taxa.
  • Determine the longest branch and set the root at its midpoint.

Tree-Building Methods

The most popular and frequently used methods of tree building can be classified into two major categories: phenetic methods based on distances and cladistic methods based on characters. The former measures the pair-wise distance/dissimilarity between two genes, the actual size of which depends on different definitions, and constructs the tree totally from the resultant distance matrix. The latter evaluate all possible trees and seek for the one that optimizes the evolution.

Distance-Based Methods:

The most popular distance-based methods are the unweighted pair group method with arithmetic mean (UPGMA), neighbor joining (NJ) and those that optimize the additivity of a distance tree (FM and ME).

  • UPGMA Method: This method follows a clustering procedure:

    1. Assume that initially each species is a cluster on its own.
    2. Join closest 2 clusters and recalculate distance of the joint pair by taking the average.
    3. Repeat this process until all species are connected in a single cluster.

Strictly speaking, this algorithm is phenetic, which does not aim to reflect evolutionary descent. It assigns equal weight on the distance and assumes a randomized molecular clock. WPGMA is a similar algorithm but assigns different weight on the distances.

UPGMS method is simple, fast and has been extensively used in literature. However, it behaves poorly at most cases where the above presumptions are not met.

  • Neighbor Joining Method (NJ): This algorithm does not make the assumption of molecular clock and adjust for the rate variation among branches. It begins with an unresolved star-like tree. Each pair is evaluated for being joined and the sum of all branches length is calculated of the resultant tree. The pair that yields the smallest sum is considered the closest neighbors and is thus joined. A new branch is inserted between them and the rest of the tree and the branch length is recalculated. This process is repeated until only one terminal is present.
NJ method is comparatively rapid and generally gives better results than UPGMA method. But it produces only one tree and neglects other possible trees, which might be as good as NJ trees, if not significantly better. Moreover, since errors in distance estimates are exponentially larger for longer distances, under some condition, this method will yield a biased tree.
  • Weighted Neighbor-Joining (Weighbor): The Weighbor criterion consists of two terms; an additivity term (of external branches) and a positivity term (of internal branches), that quantifies the implications of joining the pair. Weighbor gives less weight to the longer distances in the distance matrix and the resulting trees are less sensitive to specific biases than NJ and relatively immune to the “long branches attraction/distraction” drawbacks observed with other methods.
  • Fitch-Margoliash (FM) and Minimum Evolution (ME) Methods: Fitch and Margoliash proposed in 1967 a criteria (FM Method) for fitting trees to distance matrices. This method seeks the least squared fit of all observed pair-wise distances to the expected distance of a tree. The ME method also seeks the tree with the minimum sum of branch lengths. But instead of using all the pair-wise distances as FM, it fixed the internal nodes by using the distance to external nodes and then optimizes the internal branch lengths.
FM and ME methods perform best in the group of distance-based methods, but they work much more slowly than NJ, which generally yield a very close tree to these methods.

Character-Based Methods:

Distance-based methods are more rapid and less computationally intensive than character-based methods, but the actual characters are discarded once the distance matrix is derived. On the other hand, character-based methods make use of all known evolutionary information, i.e. the individual substitutions among the sequences, to determine the most likely ancestral relationships.

  • Maximum parsimony (MP): The criterion of MP method is that the simplest explanation of the data is preferred, because it requires the fewest conjectures. By this criterion, the MP tree is the one with fewest substitutions/evolutionary changes for all sequences to derive from a common ancestor.

For each site in the alignment, all possible trees are evaluated and are given a score based on the number of evolutionary changes needed to produce the observed sequence changes. The best tree is thus the one that minimized the overall number of mutation at all site.

MP works faster than ML and the weighted parsimony schemes can deal with most of the different models used by ML. However, this method yields little information about the branch lengths and suffers badly from long-branch attraction, that is the long branches have become artificially connected because of accumulation of inhomogous similarities, even if they are not at all phylogenetically related.

  • Maximum Likelihood (ML): Like MP methods, ML method also uses each position in an alignment and evaluates all possible trees. It calculates the likelihood for each tree and seeks the one with the maximum likelihood.

For a given tree, at each site, the likelihood is determined by evaluating the probability that a certain evolutionary model has generated the observed data. The likelihood’s for each site are then multiplied to provide likelihood for each tree.

ML method is the slowest and most computationally intensive method, though it seems to give the best result and the most informative tree.

Statistical evaluation of phylogenetic trees

Once constructed a phylogenetic tree should be evaluated its robustness [HOLMES2002], which means being able to answer the following question: How often does is obtained a given branching order considering similar sequences to those used?

The assessment trees method most widespread consists of Bootstrapping [HOLMES2003]. In this method similar sequences to those used are constructed as permutations with repetition of the multiple alignment of the corresponding columns. A new tree is constructed with the new alignment and this process is repeated a certain number of times.

To each branch is assigned a percentage of occurrences in the constructed trees and in this way is assumed that a branch is significant if it appears more than 50% or 70% of the times. Remaining branches may be condensed resulting in polytomy. It is usually to build also the consensus tree for including the most frequent branching order.

Bootstrap tree.

Figure 6.5: Bootstrap consensus tree.

This Figure shows that not all branches are significant, therefore is necesary to build a condensed tree to give more significance to those branches
Bootstrap tree.

Figure 6.6: Bootstrap condensed tree.

This Figure shows how the less significant branches have been changed into polytomy.
[JUKESCANTOR1969]Jukes T.H., and Cantor C.R. (1969). “Evolution of Protein Molecules”. New York: Academic Press. pp. 21–132.
[KIMURA1980]Kimura M. (1980). “A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences”. Journal of Molecular Evolution 16: 111–120. DOI:10.1007/BF01731581. PMID 7463489.
[TAMURA1992]Tamura K. (1992). “Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases”. Molecular Biology and Evolution 9 (4): 678–687.
[TAMURANEI1993]Tamura K., Nei M. (1993). “Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees”. Molecular Biology and Evolution 10 (3): 512–526.
[HOLMES2002]Holmes, S.P. (2002). “Statistics for Phylogenetic Trees”. Theoretical Population Biology, 63(1):17-32.
[HOLMES2003]Holmes, S.P. (2003). “Bootstrapping Phylogentic Trees: Theory and Methods”. Statistical Science, 18 (2): 241-255. doi:10.1214/ss/1063994979

Phylogenetic analysis of photosynthetic organisms

In this section there will be a practice that aims to carry out a statistical and phylogenetic study to analyze relationships among plants, algae and cyanobacteria using nucleotide and amino acids sequences of the large chain of ribulose 1,5-bisphosphate carboxylase/oxygenase (RuBisCO) of different photosynthetic organisms.

The RuBisCO enzyme is involved in a major step in carbon fixation. In this process the atmospheric carbon dioxide is converted to energy-rich molecules such as glucose. In plants, RuBisCO large chain is encoded in the chloroplast genome. In this section some features of the RuBisCO will be analyzed to investigate the relationship between chloroplasts and cyanobacteria in order to provide evidence to support the hypothesis concerning the origin of chloroplasts as endosymbionts with common ancestors with cyanobacteria.

Photosynthetic organisms to be discussed are:


  • Acaryochloris Marina MBIC11017
  • Synechocystis Sp PCC 6803
  • Trichodesmium Erythraeum IMS101
  • Synechococcus elongatus PCC 7942
  • Synechococcus sp. WH 8102
  • Prochlorococcus marinus str. MIT 9313
  • Nostoc Punctiforme PCC 73102
  • Nostoc sp. PCC 7120
  • Rhizobium etli CFN 42
  • Chlorobaculum tepidum TLS
  • Sinorhizobium meliloti 1021 plasmid pSymB


  • Chlamydomonas reinhardtii
  • Cyanidioschyzon merolae
  • Coccomyxa Subellipsoidea
  • Ostreococcus Tauri (O. tauri)
  • Emiliania huxleyi
  • Chaetosphaeridium globosum
  • Cyanidium caldarium
  • Chlorella vulgaris
  • Gracilaria tenuistipitata var. liui
  • Odontella sinensis
  • Mesostigma viride
  • Nephroselmis olivacea
  • Guillardia theta
  • Porphyra purpurea


  • Arabidopsis thaliana
  • Brachypodium Distachyon
  • Phalaenopsis aphrodite
  • Selaginella uncinata
  • Nicotiana sylvestris
  • Lactuca sativa
  • Nicotiana tomentosiformis
  • Adiantum capillus-veneris
  • Cucumis sativus
  • Atropa belladonna Ab5p
  • Psilotum nudum
  • Huperzia lucidula
  • Pinus koraiensis
  • Saccharum hybrid cultivar NCo 310
  • Lotus japonicus
  • Panax ginseng
  • Oryza nivara
  • Physcomitrella patens
  • Calycanthus floridus var. glaucus
  • Zea mays
  • Triticum aestivum
  • Marchantia polymorpha
  • Amborella trichopoda
  • Anthoceros formosae
  • Pinus thunbergii

These organisms represent a wide range in the evolutionary history of photosynthetic organisms including cyanobacteria, algae, mosses, monocotyledons and dicotyledons (angiosperms, gymnosperms).

Statistical analysis of the RuBisCO

The first part is an analysis of the basic features of RuBisCO genes in the organisms in question, these features are: length of the gene, GC content and base composition obtained from a multinomial model.

Length of the RuBisCO genes

Length comparison

Figure 6.7: Comparison of the length of RuBisCO genes

Figure 6.7 shows the comparison of length between the RuBisCO genes in cyanobacteria, algae and plants. In cyanobacteria the most frequent length is 1.431 bp [5], but there are three organisms which their length value is out of the distribution, these are: Rhizobium, Chlorobaculum and Sinorhizobium, and their length are 1.254 bp, 1.308 bp and 1.461 bp respectively. In algae the most frequent length is 1.428 bp and all these organisms are within the distribution, the highest length of algae is for Odontella with a value of 1.473 bp. Plants have the smallest distribution, the length of these organisms is between 1.428 bp and 1.440 bp, there is an atypical length value, this is 1.464 bp from Phalaenopsis.

Length comparison

Figure 6.8: Length of the RuBisCO genes

In Figure 6.8 can be noted that the most frequent length values are between 1.400 bp and 1.450 bp, the second most frequent length range is between 1.450 bp and 1.500 bp, and the less frequent length range is from 1.250 bp to 1.350. Notice that no organism has a length between 1.350 bp and 1.400 bp.

GC content in RuBisCO genes

GC content comparison

Figure 6.9: Comparison of GC content in the RuBisCO genes

Figure 6.9 shows the comparison of GC content in the RuBisCO genes in cyanobacteria, algae and plants. In cyanobacteria the GC content is distributed between 44.02% and 65.31%, all these organisms are within the distribution, and the most probable GC content value is about 53%. The highest GC content in cyanobacteria belongs to Rhizobium, and this cyanobacterium also has the gene of shortest length. For algae the GC content is distributed between 37.63% and 49.30% and the most frequent value is about 40%, all these organisms are within the distribution, the highest GC content of algae is for Coccomyxa with 49.30%. Plants have the smallest distribution, the GC content of these organisms is between 42% and 45% approximately; there are five organisms out of this distribution, they are: Marchantia, Physcomitrella, Anthoceros, Adiantum and Selaginella, their GC content are 37.46%, 38.93%, 40.61%, 47.48% and 53.22% respectively.

GC content comparison

Figure 6.10: GC content in the RuBisCO genes

In Figure 6.10 [6] can be noticed that cyanobacterial genes have the highest GC content, three of them even are over 60%. Plants have second highest GC content in their RuBisCO genes, but they also have the organism with the lowest GC content, this is Marchantia and its value is 37.46%. Finally, the lowest GC content is for algae.

Base composition in the RuBisCO genes

Figure 6.11 shows the base composition in the RuBisCO genes of cyanobacteria, it can be noticed that there is not a regular behavior or a common distribution of nucleotides along all these different genes. There are some organisms which their C-G content is higher than A-T content; this is the case of Rhizobium, Synechococcus, Sinorhizobium, Synechococcus elongatus and Chlorobaculum. On the other hand there are four organisms which their base pairs in RuBisCO genes is about 25% for each base pair, what means almost the same amount of Adenine, Cytosine, Guanine and Thymine; those organisms are: Nostoc, Nostoc Punctiforme, Acaryochloris and Prochlorococcus. Finally, there is an organism with a very different behavior, Trichodesmium, this one has the highest Thymine and Adenine content, but also has the lowest Cytosine and Guanine content, and this base composition is very similar to the base composition in plants (Figure 6.13).

Base composition in the RuBisCO genes

Figure 6.11: Base composition in the RuBisCO genes of cyanobacteria

Figure 6.12 shows the base composition in the RuBisCO genes of algae, it can be noticed that there is a regular behavior, i.e. a higher content of Adenine and Thymine, but a low content of Guanine and Cytosine. For all cases the Cytosine is the less frequent nucleotide. There are three organisms which the base composition in their RuBisCO genes is different from the other algae, those are Coccomyxa, Ostreococcus and Nephroselmis, and the first and the second have almost the same distribution of nucleotides.

Base composition in the RuBisCO genes

Figure 6.12: Base composition in the RuBisCO genes of algae

Figure 6.13 shows the base composition in the RuBisCO genes of plants, it can be noticed that there is a regular behavior. Adenine and Thymine have the highest frequencies, then, Guanine, and finally, Cytosine; just the same that in algae, but in this case the difference between the amount of Thymine and Guanine is less. There are three organisms where their RuBisCO genes have the same behavior than in algae, those are Physcomitrella, Marchantia and Anthoceros. There is also an organism which its base composition is very alike to the cyanobacterial base composition, this is Selaginella.

Base composition in the RuBisCO genes

Figure 6.13: Base composition in the RuBisCO genes of plants

Figures 6.14, 6.15 and 6.16, show the data of length, GC content and base composition in cyanobacteria, algae and plants.

Basic analysis data

Figure 6.14: Basic analysis data of the RuBisCO genes in cyanobacteria

Basic analysis data

Figure 6.15: Basic analysis data of the RuBisCO genes in algae

Basic analysis data

Figure 6.16: Basic analysis data of the RuBisCO genes in plants

K-mers Frequency in the RuBisCO genes

Figures 6.17, 6.18 and 6.19 show the dinucleotide frequency in the RuBisCO genes of the discussed organisms. For algea and plants the less significant dinucleotides are CC and CG, and the most significants are AA, TG and TT. For cyanobacteria the less significant dinicleotide is TA. theres is not evidence of any particular dinucleotide as the most significant in cyanobacteria, but for Rhizobium and Sinorhizobium the most frequent are CG and GC, those are also the highest frequencies in cyanobactaria dinucleotides.

According to Figure 6.18, the highest frequencies of dinucleotides in Algae are AA in Cyanidium, TT in Gracilaria, and AA in Guillardia and Porphyra.

Figure 6.19 shows that Selaginella has a different behavior respect to the other plants, where most of them has the lowest frequency value, this one has their highest (dinucleotides CC and CG), and when the other plants reach their highest frequency, this one has two of the lowest frequencies (dinucleotides TG and TT).

K-mers frequency

Figure 6.17: Dinucleotide frequency in the RuBisCO genes of cyanobacteria

K-mers frequency

Figure 6.18: Dinucleotide frequency in the RuBisCO genes of algae

K-mers frequency

Figure 6.19: Dinucleotide frequency in the RuBisCO genes of plants

Figures 6.20, 6.21 and 6.22 show the trinucleotide frequency in the RuBisCO genes of the discussed organisms. For cyanobactaria there is not a common behavior, along of trinucleotides frequency, there are two organisms which are verey alike, those are Rhizobium and Sinorhizobium. For Algae trinucleotides there are two regions of low frequency, the first one is composed by trinucleotides CCC, CCG, CCT, CGA, CGC, CGG; and the second one by trinucleotides GGC and GGG. On the other hand, the highest frequencies in algae are those in trinucleotide AAA, except by Odontella. The most significant trinucleotides in plants are AAA, ATG, ATT, TTG and TTT; and the less significant are CCC, CGC, CGG and GGC.

Figure 6.22 shows that trinucleotides frequency is very alike between Physcomitrella and Marchantia, and Selaginella seems to have a different behavior than the other plants.

K-mers frequency

Figure 6.20: Trinucleotide frequency in the RuBisCO genes of cyanobacteria

K-mers frequency

Figure 6.21: Trinucleotide frequency in the RuBisCO genes of algae

K-mers frequency

Figure 6.22: Trinucleotide frequency in the RuBisCO genes of plants

Phylogenetic analysis of the RuBisCO

According to the previous section, the phylogenetic analysis of the RuBisCO will be carried out in five phases.

Phase 1: Selecting of the biological sequences to be analyzed.

Biologycal sequences of the RuBisCO gene of the discussed organisms were taken from the NCBI DataBank. For prokaryotes the genes were found in their nuclear DNA, but for plants and some algae the genes were found in their chloroplast DNA.

The NCBI DataBank has separated this gene from the other genes in several organisms, but there are still some organisms where this gene has to be searched on the complete genome.

Phylogenetic analysis of the RuBisCO could be performed with nucleotides or amino acids, but due to the evolutionary distances among groups of organisms it is better to use nucleotides.

Phase 2: Building a multiple alignment of the biological sequences selected.

Mega was used to create the multiple sequence alignment of the RuBisCO gene of all selected organisms. This tool allows to create the multiple alignment by two methods ClustalW and MUSCLE, both were used to evaluate the best fit substitution models in order to obtain a better phylogenetic tree.

There was not a remarkable difference on execution time between ClustalW method and MUSCLE method, it took about half an hour in both cases.

Phase 3: Selecting the substitution models.

According to the ClustalW and MUSCLE methods, Figure 6.23 and Figure 6.24 respectively, the best model is Tamura 3-parameter (T92) [7] using a discrete Gamma distribution (+G), and the second best model is is Tamura 3-parameter (T92) using a discrete Gamma distribution (+G) plus assuming that a certain fraction of sites are evolutionarily invariable (+I).

For ClustalW the third and the fourth best models are the Tamura-Nei’s models (TN93+G and TN93+G+I). On the other hand, MUSCLE estimates that General Time Reversible models (GTR+G and GTR+G+I) are better at the same positions.

Substitution models

Figure 6.23: Best fit substitution models according to ClustalW

Substitution models

Figure 6.24: Best fit substitution models according to MUSCLE

Models with the lowest BIC scores (Bayesian Information Criterion) are considered to describe the substitution pattern the best. For each model, AICc value (Akaike Information Criterion, corrected), Maximum Likelihood value (lnL), and the number of parameters (including branch lengths) are also presented [NEI2000]. Non-uniformity of evolutionary rates among sites may be modeled by using a discrete Gamma distribution (+G) with 5 rate categories and by assuming that a certain fraction of sites are evolutionarily invariable (+I). Whenever applicable, estimates of gamma shape parameter and/or the estimated fraction of invariant sites are shown. Assumed or estimated values of transition/transversion bias (R) are shown for each model, as well. They are followed by nucleotide frequencies (f) and rates of base substitutions (r) for each nucleotide pair. Relative values of instantaneous r should be considered when evaluating them. For simplicity, sum of r values is made equal to 1 for each model. For estimating ML values, a tree topology was automatically computed. The analysis involved 50 nucleotide sequences. Codon positions included were 1st + 2nd + 3rd + Noncoding. All positions containing gaps and missing data were eliminated. There were a total of 1221 positions in the final dataset. Evolutionary analyses were conducted in MEGA5 [TAMURA2011].

Phase 4: Building the phylogenetic tree.

For building the phylogenetic tree of the RuBisCO genes in prokaryotes, algae and plants; it was used Tamura 3-parameter substitution model using a discrete Gamma distribution (Gamma = 0.52) according to the best fit sugested by ClustalW, the multiple sequence alignment method. The phylogenetic tree was build based on Neighbor-Joining algorithm including transitions and transversions as parameters to the substitution method. Phylogenetic tree was constructed using MEGA.

Phylogenetic tree

Figure 6.25: Consensus tree of the RuBisCO genes

Figure 6.25 shows the consensus tree of the RuBisCO genes of all organisms discused in this section. According to the tree there is evidence of a common ancestor between cyanobacteria, algae and plants. Sinorhizobium meliloti is an alphaproteobacterium and shares the same common ancestor than cyanobacterias, then, Trichodesmium Erythraeum (a cyanobacterium) also shares a common ancestor with algae, and finally, Chaetosphaeridium globosum an alga, shares the same common ancestor with plants, moreover, this alga is also classified as a green plant.

Phylogenetic tree

Figure 6.26: Consensus tree of the RuBisCO genes (Prokaryotes)

Figure 6.26 shows the sub-tree of prokaryotes, there are something interesting in this figure, Rhizobium etli CFN and Sinorhizobium meliloti are both alphaproteobacteria but they appear at different branches, moreover, Rhizobium etli and Chlorobaculum tepidum are from different species (Chlorobaculum tepidum is a green sulfur bacterium) but they not only share the same common ancestor, the percentage of occurrences at their branches is 100%. This two organisms appear on the consensus tree as an outgroup.

Phylogenetic tree

Figure 6.27: Consensus tree of the RuBisCO genes (Algae)

Figure 6.27 shows the sub-tree of algae, is easy to identify in this tree the red algae group, the green algae group, the green plants group, and also a group of tree algae from different species, this algae are: Odontella sinensis, Guillardia theta and Emiliania huxleyi; and their species diatom, cryptomonad and haptophyte, respectively.

Phylogenetic tree

Figure 6.28: Consensus tree of the RuBisCO genes (Plants)

Figure 6.28 shows the sub-tree of plants, is interesting to see how some organisms like Pinus thunbergii and Pinus koraiensis, Zea mays and Saccharum hybrid cultivar, and Brachypodium Distachyon and Triticum aestivum have a percentageof occurences over 95%, actually 100% for the first four. It can also be noted in this figure that there are three plants which seem farther from the others those are Anthoceros formosae, Marchantia polymorpha and Physcomitrella patens.

The consensus tree provides enough information for thinking cyanobacteria, algae and plants have a common ancestor, despite this, the consensus tree does not show a percentage of occurrences higher than 50% for all its branches, therefore is necesary to build a condensed tree that provides a greater statistical evidence for trusting this organisms share a common ancestor.

Phase 5: Statistical evaluation of the phylogenetic trees.

For building the consensus tree it was tested by the bootstrap method, and there were made 500 replications. Then, to build the condensed tree it was selected a cut-off value equal to 50%.

Phylogenetic tree

Figure 6.29: Condensed tree of the RuBisCO genes

Figure 6.29 shows the condensed tree. For prokaryotes there are some few changes, Sinorhizobium meliloti, Synechococcus sp. WH 8102 and Synechococcus elongatus PCC 7942 were grouped at the same branch, in the same way for Acaryochloris Marina and Synechocystis Sp PCC 6803, and Nostoc Punctiforme PCC 73102 and Nostoc sp. PCC 7120. For algae, there is an important change, the red algae ware grouped at the same clade. For plants there are not any important change.

Phylogenetic tree

Figure 6.30: Condensed circular tree of the RuBisCO genes

Figure 6.30 gives a different point of view of the condensed tree, in this figure some branches can be seen more clearly.

Results of the analysis

The endosymbiotic theory [MARGULIS1981] concerns the mitochondria, plastids (e.g. chloroplasts), and possibly other organelles of eukaryotic cells. According to this theory, certain organelles originated as free-living bacteria that were taken inside another cell as endosymbionts. The theory postulates that chloroplast evolved from endosymbiotic cyanobacteria, although this statistic and phylogenetic analysis does not provides enough evidence to ensure that chloroplasts were evolved from endosymbiotic cyanobacteria, this analysis do allow to ensure that the RuBisCO gene present in prokaryotes shares common ancestors with the RuBisCO gene present in eukaryotic chloroplasts.

[NEI2000]Nei M. and Kumar S. (2000). “Molecular Evolution and Phylogenetics”. Oxford University Press, New York.
[TAMURA2011]Tamura K., Peterson D., Peterson N., Stecher G., Nei M., and Kumar S. (2011). “MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods”. Molecular Biology and Evolution (In Press).
[MARGULIS1981]Margulis L. (1981). “Symbiosis in Cell Evolution”. WH Freeman, New York.


[1]Basic Local Alignment Search Tool (BLAST) is the tool most frequently used for calculating sequence similarity. BLAST comes in variations for use with different query sequences against different databases.
[2]Molecular Evolutionary Genetics Analysis (MEGA) is an integrated tool for conducting automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based databases, estimating rates of molecular evolution, inferring ancestral sequences, and testing evolutionary hypotheses.
[3]MUltiple Sequence Comparison by Log-Expectation (MUSCLE) is a program for creating multiple alignments of amino acid or nucleotide sequences.
[4]Clustal W is a general purpose multiple alignment program for DNA or proteins.
[5]bp = base pairs or nucleotides
[6]Numbers representing organisms are associated with the numbers of organisms in Figure 6.14 (cyanobacteria), Figure 6.15 (algae) and Figure 6.16 (plans).
[7]Abbreviations: GTR: General Time Reversible; HKY: Hasegawa-Kishino-Yano; TN93: Tamura-Nei; T92: Tamura 3-parameter; K2: Kimura 2-parameter; JC: Jukes-Cantor.