# Biological Sequence Alignment¶

In the previous chapter the ab initio methods were studied to identify genes in the sequences of nucleotides that make up the genomes of living organisms. The next step in the annotation of a genome is to assign potential functions to different genes, i.e., prediction of functionality.

This task can be assisted by mathematical-computational methods that use available information on gene function in other genomes different from the studied. These methods assume that by knowing the function of a gene in an organism can be inferred that similar genes have a similar function in other organisms.

Thus, the task of assigning potential function to genes is reduced to measure the similarity between genes. This task is solved by comparing the corresponding sequences of nucleotides or amino acids carrying a possibly alignment between similar sequences.

## Introduction¶

The alignment of biological sequences is probably the most important and most accomplished in the field of bioinformatics. Their ultimate goal is to determine the similarity between different sequences. The sequence alignment is therefore a great number of applications:

• Gene finding: Having identified all ORFs of a genome can be used their lengths to estimate the probability of actually constituting genes, these are ab initio methods mentioned before. On the other hand, if there is a gene from another organism with a good alignment with the ORF found, there is strong evidence that this ORF is a gene.
• Function prediction: The alignment of sequences to determine if two genes are similar. That way if the first function of these genes is known we can assign the same function as the second.
• Genome Sequence Assembly: the de-novo assembly is a very important task in bioinformatics is based on the alignments between short DNA sequences obtained by the new-generation sequencers.

One of the main applications of sequence alignment is the identification of homologous genes. Living organisms share a large number of genes descended from common ancestors and have been maintained in different organisms due to its functionality but accumulate differences that have diverged from each other. These differences may be due to mutations that change a symbol (nucleotide or amino acid) for another or insertions / deletions, indels, which insert or delete a symbol in the corresponding sequence.

Two genes are homologous if they share a common ancestor. Therefore, the homology is a dichotomous characteristic, i.e., given two genes are either homologous genes or not. However, given two sequences corresponding to two genes, can be said that there are different levels of similarity based on an alignment between them. The key task is to determine whether a good alignment between two sequences is significant enough to consider that both genes are homologous. This task, in the same way as section 4.2.2, is done through a hypothesis testing and the corresponding p-values are used to make a decision.

There are two different forms of homology. When the origin of two homologous genes is due to a process of gene duplication within the same species these genes are called paralogs genes, whereas when the origin is due to a speciation process resulting in homologous genes in these different species are called orthologous genes.

A preliminary analysis to study the biological similarity between two sequences s and t is to produce a dotplot. A dotplot is a graphical representation that places the corresponding sequences in the horizontal and vertical axis. Each point (i,j) of the graph compares the symbols s[i] and t[j]. If both matches, the corresponding cell is drawn in black, otherwise it remains white. In this way, regions that have a high similarity in the dotplot appear as line segments that can be on the main diagonal or outside it.

Figure 5.1 shows an example of similarity between the protein RuBisCO of the cyanobacterium Prochlorococcus Marinus MIT 9313 and the unicellular green alga Chlamydomonas reinhardtii.

Figure 5.1: Similarity between RuBisCO proteins

## Global sequence alignment¶

The overall similarity between two biological sequences is studied usually doing an alignment between them.

### Global alignment¶

Given two biological sequences s and t, and a special symbol “-“ to represent gaps. A global alignment of s and t is defined as the insertion of gaps at the beginning, end or inside of sequences s and t such that the resulting strings s’ and t’ are the same length and can establish a correspondence between the symbols s’[i] and t’[i]. The only restriction consists in not allowing the corresponding symbols s’[i] and t’[i] to be the gap symbol since in this case both symbols can be removed from both sequences.

Frequently, an alignment between two biological sequences is represented as a matrix of three rows. The first row represents the first sequence while the second sequence is described in the third row. The second row represents the matching symbols between the first and second sequence using the pipe symbol “|”. The mismatches and gaps between sequences are represented by the blank symbol. For example, the following matrix shows the alignment between the first 20 amino acids of the RuBisCO protein of Prochlorococcus Marinus MIT 9313 and Chlamydomonas reinhardtii:

```"M"  "-"  "-"  "-"  "S"  "K"  "-"  "K"  "Y"  "D"   "A"   "G"   "V"
"|"  " "  " "  " "  " "  " "  " "  "|"  " "  " "   "|"   "|"   " "
"M"  "V"  "P"  "Q"  "T"  "E"  "T"  "K"  "A"  "G"   "A"   "G"   "F"

"K"  "E"  "-"   "Y"   "R"   "D"   "T"   "Y"   "W"   "T"   "P"   "D"
"|"  " "  " "   " "   " "   "|"   " "   "|"   " "   " "   " "   " "
"K"  "A"  "G"   "V"   "K"   "D"   "-"   "Y"   "-"   "-"   "-"   "-"
```

To determine the similarity between two biological sequences must be sought the optimal global alignment between them. To perform this task is necessary to assign a score to each possible alignment. This is done using substitution matrices.

### Substitution matrices¶

Given an alphabet S of length n which contains the symbols of biological sequences studied (typically S = SDNA, or S = SAA). A substitution or scoring matrix, M, associated with S is defined as a square matrix of order (n+1)x(n+1) where the first n rows and columns correspond to the symbols of S while the last row and column corresponding to the gap symbol “-”. The cell (i,j), for i=1,...n, y j=1,...,n, represents the value associated with the correspondence between siand sjsymbols in a given alignment between two sequences. Denote this value by M(si,sj). The last row and column represent the value associated with the correspondence between a symbol of S and a gap, M(-,sj), in a given alignment between two sequences.

Thus, substitution matrices tend to verify:

• M(si,sj) = M(sj,si)
• M(si,si) > 0.
• M(si,sj) < 0, for i different from j.
• M(si, -) < 0

Substitution matrices seek to capture the biochemical similarity between the different monomers constituting biological sequences to better reflect evolutionary processes.

In the case of DNA sequences is known that nucleotides are divided into purines (a, g) and pyrimidines (c, t). The nucleotide substitutions of the same type (a <-> g or c <-> t) are called transitions. While nucleotide substitutions of different types (a <-> c, a <-> t, g <-> c, or g <-> t) are called transversions. It is observed that due to the biochemical properties, transitions are more frequent than transversions. Often, this is captured in the corresponding substitution matrices assigning higher penalties to transversions than transitions. Substitution matrices for the DNA sequences are thus of order 4x4, such as the following example:

```     a    g    c    t    -
a  1.0 -1.0 -1.5 -1.5 -2.0
g -1.0  1.0 -1.5 -1.5 -2.0
c -1.5 -1.5  1.0 -1.0 -2.0
t -1.5 -1.5 -1.0  1.0 -2.0
- -2.0 -2.0 -2.0 -2.0 -2.0```

In a highly marked way, in amino acids, not all possible substitutions are observed with the same frequency due to the different biochemical properties such as size, porosity and hydrophobicity that make some of them interchangeable between them more than others. Substitution matrices for polypeptide sequences tend to lower the penalties for such substitutions between amino acids in an alignment. The two families of substitution matrices for amino acids most commonly used are the PAM and BLOSUM matrices.

PAM (Point Accepted Mutations) matrices are obtained from a base matrix PAM1 estimated from known alignments between DNA sequences that differ only by 1%. To compare more divergent sequences are used extrapolations of this matrix which are obtained as powers of PAM1. For example, PAM250 is obtained by multiplying PAM1 itself 250 times. It is noteworthy that the extrapolation is not linear, i.e., PAM250 is not used for sequences that differ by 250%.

However, BLOSUM (Blocks Substitution Matrix) matrices are estimated from known alignments between sequences that differ by a fixed percentage. For example, the BLOSUM62 matrix is constructed using sequences for which are known to differ by 62%.

The following is an example of PAM and BLOSUM substitution matrices.

```PAM250

A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  2 -2  0  0 -2  0  0  1 -1 -1 -2 -1 -1 -3  1  1  1 -6 -3  0  0  0  0 -8
R -2  6  0 -1 -4  1 -1 -3  2 -2 -3  3  0 -4  0  0 -1  2 -4 -2 -1  0 -1 -8
N  0  0  2  2 -4  1  1  0  2 -2 -3  1 -2 -3  0  1  0 -4 -2 -2  2  1  0 -8
D  0 -1  2  4 -5  2  3  1  1 -2 -4  0 -3 -6 -1  0  0 -7 -4 -2  3  3 -1 -8
C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3  0 -2 -8  0 -2 -4 -5 -3 -8
Q  0  1  1  2 -5  4  2 -1  3 -2 -2  1 -1 -5  0 -1 -1 -5 -4 -2  1  3 -1 -8
E  0 -1  1  3 -5  2  4  0  1 -2 -3  0 -2 -5 -1  0  0 -7 -4 -2  3  3 -1 -8
G  1 -3  0  1 -3 -1  0  5 -2 -3 -4 -2 -3 -5  0  1  0 -7 -5 -1  0  0 -1 -8
H -1  2  2  1 -3  3  1 -2  6 -2 -2  0 -2 -2  0 -1 -1 -3  0 -2  1  2 -1 -8
I -1 -2 -2 -2 -2 -2 -2 -3 -2  5  2 -2  2  1 -2 -1  0 -5 -1  4 -2 -2 -1 -8
L -2 -3 -3 -4 -6 -2 -3 -4 -2  2  6 -3  4  2 -3 -3 -2 -2 -1  2 -3 -3 -1 -8
K -1  3  1  0 -5  1  0 -2  0 -2 -3  5  0 -5 -1  0  0 -3 -4 -2  1  0 -1 -8
M -1  0 -2 -3 -5 -1 -2 -3 -2  2  4  0  6  0 -2 -2 -1 -4 -2  2 -2 -2 -1 -8
F -3 -4 -3 -6 -4 -5 -5 -5 -2  1  2 -5  0  9 -5 -3 -3  0  7 -1 -4 -5 -2 -8
P  1  0  0 -1 -3  0 -1  0  0 -2 -3 -1 -2 -5  6  1  0 -6 -5 -1 -1  0 -1 -8
S  1  0  1  0  0 -1  0  1 -1 -1 -3  0 -2 -3  1  2  1 -2 -3 -1  0  0  0 -8
T  1 -1  0  0 -2 -1  0  0 -1  0 -2  0 -1 -3  0  1  3 -5 -3  0  0 -1  0 -8
W -6  2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4  0 -6 -2 -5 17  0 -6 -5 -6 -4 -8
Y -3 -4 -2 -4  0 -4 -4 -5  0 -1 -1 -4 -2  7 -5 -3 -3  0 10 -2 -3 -4 -2 -8
V  0 -2 -2 -2 -2 -2 -2 -1 -2  4  2 -2  2 -1 -1 -1  0 -6 -2  4 -2 -2 -1 -8
B  0 -1  2  3 -4  1  3  0  1 -2 -3  1 -2 -4 -1  0  0 -5 -3 -2  3  2 -1 -8
Z  0  0  1  3 -5  3  3  0  2 -2 -3  0 -2 -5  0  0 -1 -6 -4 -2  2  3 -1 -8
X  0 -1  0 -1 -3 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1  0  0 -4 -2 -1 -1 -1 -1 -8
* -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8  1```
```BLOSUM62

A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  J  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1 -1 -1 -4
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1 -2  0 -1 -4
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  4 -3  0 -1 -4
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4 -3  1 -1 -4
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -1 -3 -1 -4
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0 -2  4 -1 -4
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1 -3  4 -1 -4
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -4 -2 -1 -4
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0 -3  0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3  3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4  3 -3 -1 -4
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0 -3  1 -1 -4
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3  2 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3  0 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -3 -1 -1 -4
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0 -2  0 -1 -4
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1 -1 -1 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -2 -2 -1 -4
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -1 -2 -1 -4
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3  2 -2 -1 -4
B -2 -1  4  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4 -3  0 -1 -4
J -1 -2 -3 -3 -1 -2 -3 -4 -3  3  3 -3  2  0 -3 -2 -1 -2 -1  2 -3  3 -3 -1 -4
Z -1  0  0  1 -3  4  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -2 -2 -2  0 -3  4 -1 -4
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1```

Given two sequences s and t, an alignment of them A of length m and a substitution matrix M, the alignment score can be assigned by adding the values represented in M for each position of the alignment of A:

Score(A) = M(A(1,1),A(3,1)) + M(A(1,2),A(3,2)) + ... + M(A(1,m),A(3,m))

Since it is possible to measure the goodness of an alignment through the points obtained using a substitution matrix the optimal global alignment between two sequences can be defined as the one who obtains the highest possible score.

Thus, the computational problem to be solved is, given two sequences s and t, and a substitution matrix M; find A* the optimal global alignment between s and t. The brute force algorithm consists of enumerating all possible alignments between s and t and then take the highest score, this is computationally intractable due to the number of possible alignments between two given sequences. However, in 1970 Needleman and Wunsch introduced an algorithm based on dynamic programming to find efficiently the optimal alignment between two given sequences.

### Needleman-Wunsch Algorithm¶

The Needleman-Wunsch algorithm is a sample of dynamic programming, introduced in the previous chapter, which is based on the division of the problem addressed in simpler subproblems so that the complete solution can be obtained by combining the partial solutions corresponding subproblems.

The algorithms that use this technique follow a similar structure consisting of the establishment of recursive relationships, storing partial results in a table and a traceback to finally build the solution.

The following describes the general structure of the algorithm:

• Recursive relationships: The main idea behind the Needleman-Wunsch algorithm is based on the fact that to calculate the optimal alignment score between the first i and j symbols of two sequences is sufficient to know the optimal alignment score up to the previous positions. In this way the problem of aligning two sequences s and t is reduced to solve the subproblems associated with aligning the indices of s and t. For this, the following function is introduced:

Score (i+1,j+1) = the optimal alignment score for indices s[1:i] y t[1:j].

Knowing previously the partial alignment of the sequences s and t up to the symbols i and j, the alignment up to the following symbols could be known, i.e., symbols i+1 and j+1, for which there are three different possibilities:

• Aligning the symbols s[i] and t[j]. This involves moving to the following symbols of s and t, and add the corresponding score of aligning symbols s[i] and t[j] according to the substitution matrix M:

Score(i+1,j+1) = Score(i,j) + M(s[i],t[j])

• Insert a gap in the sequence t. This means not moving to the next symbol of t, but to the next symbol of s and add the penalty of aligning the symbol s[i] with the gap symbol according to the substitution matrix M:

Score(i+1,j+1) = Score(i,j+1) + M(s[i],-)

• Insert a gap in the sequence s. This means not moving to the next symbol of s, but to the next symbol of t and add the penalty of aligning the symbol t[j] with the gap symbol according to the substitution matrix M:

Score(i+1,j+1) = Score(i+1,j) + M(-,t[j])

Therefore, to obtain the maximum score to the positions i and j is sufficient to take the maximum of three possible decisions to be taken:

Score(i+1,j+1) = max {Score(i,j) + M(s[i],t[j]), Score(i,j+1) + M(s[i],-), Score(i+1,j) + M(-,t[j])}

In the above calculation one of three decisions must be taken: (1) align the two corresponding symbols, (2) adding a gap in the second sequence or (3) add a gap in the first sequence. This decision should be stored:

decision(i+1,j+1) = arg max {Score(i,j) + M(s[i],t[j]), Score(i,j+1) + M(s[i],-), Score(i+1,j) + M(-,t[j])}

As a base cases the scores corresponding to align index s[1:i] with i gap symbols and index t[1:j] with j gap symbols can be set as follow:

Score(1,1) = 0

Score(i+1,1) = M(s[1],-) + ... + M(s[i],-) for i=1,...,n

Score(1,j+1) = M(-,t[1]) + ... + M(-,t[j]) for j=1,...,m

• Tabular computations: To calculate and store progressively the scores, Score(i, j) a table of dimensions (n+1) x (m+1), is used where n is the length of the first sequence to align, s, and m is the length of the second sequence to align, t. Initially the first row and first column are filled with multiples of the penalty for adding a gap:

Score(1,1) = 0

Score(i+1,1) = M(s[1],-) + ... + M(s[i],-) for i=1,...,n

Score(1,j+1) = M(-,t[1]) + ... + M(-,t[j]) for j=1,...,m

Additionally, in another table called decisions, of the same dimension, the decisions made in each cell of Score are stored. The alignment of two symbols is represented by the number 1, the insertion of a gap in the second is represented by the number 2 and finally the insertion of a gap in the first sequence is represented by the number 3. Therefore, the first row and first column of decisions are populated with the values:

decisions(1,1) = 0

decisions(i,1) = 3 for i=2,...,n+1

decisions(1,j) = 2 for j=2,...,m+1

Progressively, for i = 1,...,n y j=1,...,m the remaining cells of the table Score are filled according to recursive relationship:

Score(i+1,j+1) = max {Score(i,j) + M(s[i],t[j]), Score(i,j+1) + M(s[i],-), Score(i+1,j) + M(-,t[j])}

Likewise, the decisions matrix stores the decision made in each cell of Score:

decision(i+1,j+1) = arg max {Score(i,j) + M(s[i],t[j]), Score(i,j+1) + M(s[i],-), Score(i+1,j) + M(-,t[j])}

• Traceback: Once completed the Score and decisions tables, the optimal alignment score between s and t corresponds with the value Score(n+1,m+1), the value stored in the last cell. To reconstruct the decisions taken in the optimal alignment the decisions table must be covered backward as follows:

1. Two pointers are initialized k = n+1 y l = m+1, and the length of alignment alingment.length = 1

2. Sets taken.decisions [alingment.length] = decisions[k,l]

3. Depending on the value of taken.decisions the pointers are moved upward, left or diagonally across the table.

• If taken.decisions [alingment.length] is equal to 1 then a symbol of each sequence has been aligned and therefore the pointers are moved diagonally, i.e., k = k - 1 and l = l - 1.
• If taken.decisions [alingment.length] is equal to 2 then a gap has been added in the second sequence and therefore the pointers are moved one position to the left, i.e., k = k and l = l - 1.
• If taken.decisions [alingment.length] is equal to 3 then a gap has been added in the first sequence and therefore the pointers are moved up one position, i.e., k = k - 1, l = l.
4. Updates the length of the alignment, alignment.length = alignment.length + 1.

5. If cell 1,1 has been reached, whose value is 0, then the algorithm is complete. Otherwise, the current cell will be inspected again from step 2.

This algorithm has been implemented in GetGlobalAlignmentData function. Additionally, GetDecisionTraceback function performs the traceback on Needleman-Wunsch algorithm, taking as input the matrix of decisions taken. Finally, GetAlignmentMatrix function constructs the alignment between two given sequences once executed the Needleman-Wunsch algorithm:

### Statistical significance of alignments¶

Once the optimal global alignment score between the sequences of two genes has been determined must decide if this value is because both genes are homologous or pure randomness. As in the previous chapter, the methodology used to determine whether the optimal alignment between two sequences is statistically significant is to make a hypothesis test.

The hypothesis test is designed in this case is as follows:

H0: The alignment is random

H1: The alignment is significant and both genes are homologous

The statistic used in this test is the optimal alignment score between the two genes.

The next step is to calculate the associated p-value. The p-value is defined as the probability of obtaining the value of statistical due to pure randomness assuming the null hypothesis is true. To do this, the alignment score of the first gene is calculated with random sequences obtained following the same model of the second gene (the Markov model or multinomial model). The corresponding p-value is estimated as the relative frequency of random alignment scores that exceed or equal the optimal alignment score between two given genes.

Finally, the significance level, alpha, is set. If the estimated p-value is much lower than the significance level, the null hypothesis is rejected and therefore can be said that there is evidence that both genes are homologous. Otherwise, the alignment is not significant and there is no evidence of homology.

The first step in determining the statistical significance of an alignment is to generate amino acid sequences following the same Markov model (it would also be feasible to use multinomial models) of one of the two sequences. To do this the GetAminoAcidMarkovModel function is used, which receives as input an amino acid sequence and returns the corresponding Markov model.

Then, to generate random sequences the GetRandomSequence function is implemented, which receives as input the elements of a Markov model of a sequence, i.e., initial.probabilities and transition.probabilities; it also receives the length of the random sequence to generate sequence.length and the symbols used in that sequence, sequence.symbols.

Finally, the p-value associated with an alignment is estimated according to the algorithm implemented in GetAlignmentSignificance function. Given two sequences to estimate the corresponding p-value the probability of obtaining a score (estimate value) better than that for the optimal alignment between them must be calculated by generating random alignments. To do this the optimal alignment score between sequence.1 and sequence.2, is calculated, i.e., alignment.score.

As an example, results from the Rubisco protein alignment between the cyanobacterium Prochlorococcus Marinus MIT 9313 and the alga Chlamydomonas reinhardtii, available in UniProt with accession numbers Q7V6F8 [1] and P00877 [2] respectively.

Figure 5.2 shows a histogram that relates the score for alignments with random sequences and their frequencies, but none of them reaches the optimal alignment score, which in this case is 1794, can therefore be concluded that this alignment is significant and both proteins are homologous.

Figure 5.2: Statistical significance of alignments

## Local sequence alignment¶

Biological sequences such as proteins are composed of different parts called domains. These subsequences have an associated function. For example, the structure associated with the zinc finger domain is involved in protein-DNA interaction. In this context, a very common situation is to find local similarities between two biological sequences s and t, i.e., determine two subsequences s’ and t’ that could be aligned. In this way can be found common conserved domains and assigned as possible functions those associated with the corresponding domains aligned.

### Local alignment¶

The local alignment between two sequences s and t consists firstly in remove from each sequence a prefix and suffix for two subsequences s’ and t’. Then a global alignment is performed between these sequences. Just as in the case of global alignment scoring matrices are used. In the case of proteins, once again the families of substitution matrices most used are PAM and BLOSUM matrices.

The task of finding the optimal local alignment between two sequences s and t consists of determining the indices (i,j) and (k,l) such that the global optimum alignment between the subsequences s[i:j] and t[k:l] obtains the highest score among all possible choices of indices.

### Smith-Waterman Algorithm¶

Initially the search for the optimal local alignment between two sequences s and t is computationally more expensive that searching for the optimal global alignment since the former requires calculating the global optimum algorithm among all subsequences of s and t to select the one with the highest score. However, an adaptation of the Needleman-Wunsch Algorihtm to the local case makes both tasks have the same computational cost.

This algorithm is called the Smith-Waterman algorithm and follows the same scheme based on dynamic programming than the Needleman-Wunsch algorithm.

Following describes the general structure of the algorithm:

• Recursive relationships: The main idea behind the Smith-Waterman algorithm is to add a fourth option when extending a partial alignment to prevent the alignment score from being negative. The fourth option is 0 and therefore corresponds to removing a prefix of both sequences.

Score(i+1,j+1) = max {Score(i+1,j) + M(-,t[j]), Score(i,j+1) + M(s[i],-), Score(i,j) + M(s[i],t[j]), 0}

In the above calculation should be decided on: (1) adding a gap in the first sequence, (2) adding a gap in the second sequence or (3) align the two corresponding symbols and (4) delete the corresponding prefix. As in algorithm of Needleman-Wunsch this decision should be stored:

decision(i+1,j+1) = arg max {Score(i+1,j) + M(-,t[j]), Score(i,j+1) + M(s[i],-), Score(i,j) + M(s[i],t[j]),0}

As a base cases can be established the scores for eliminating prefixes s[1:i] or t[1:j] with i,j=1,...n:

Score(1,1) = 0

Score(i+1,1) = 0 for i=1,...,n

Score(1,j+1) = 0 for j=1,...,m

• The traceback on Smith-Waterman algorithm also differs from that made in Needleman-Wunsch. Once completed the tables Score and decisions, the optimal local alignment score between s and t corresponds to the maximum value of the table Score(i’,j’). Taking this value corresponds to removing the suffix s[i’:n] and t[j’:m].

To reconstruct the decisions taken in the optimal alignment must move on the decisions table as follows:

1. Two pointers are initialized k = i’ y l = j’, and the length of alignment alingment.length = 1

2. Sets taken.decisions[alingment.length] = decisions[k,l]

3. Depending on the value of taken.decisions the pointers are moved upward, left or diagonally across the table.

• If taken.decisions[alingment.length] is equal to 1 then a gap has been added in the first sequence and therefore the pointers are moved up one position, i.e., k = k - 1, l = l.
• If taken.decisions[alingment.length] is equal to 2 then a gap has been added in the second sequence and therefore the pointers are moved one position to the left, i.e., k = k and l = l - 1.
• If taken.decisions[alingment.length] is equal to 3 then a symbol of each sequence has been aligned and therefore the pointers are moved diagonally, i.e., k = k - 1 and l = l - 1.
4. Updates the length of the alignment, alignment.length = alignment.length + 1.

5. If the cell whose value is 0 has been reached, then the algorithm is complete. Otherwise, the current cell will be inspected again from step 2.

This algorithm has been implemented in GetLocalAlignmentData function. Additionally, GetLocalDecisionsTraceback function performs the traceback on Smith-Waterman algorithm, taking as input scores and decisions matrices. Finally, GetLocalAlignmentMatrix function constructs the alignment between two given sequences once executed the Smith-Waterman algorithm:

## Global comparison of genomes: Synteny¶

This section will provide a method of comparing DNA sequences at a higher level to that seen in the previous two sections. Instead of relying on small variations between homologous genes due to substitutions, insertions and deletions will analyze the relative position of genes in complete genomes of different organisms. This type of analysis is part of the comparative genomics, which studies the organization, functions and evolution of whole genomes.

Comparative genomics studies the global transformations that are commonly observed in evolutionarily close species genomes. These transformations involve rearrangements of complete fragments of the genome that may contain hundreds of genes. The most common of these reorganizations are:

1. Inversions: are transformations that turn complete DNA fragments reversing the order of all genes in this fragment.
2. Transpositions: are transformations that occur when complete DNA fragments are cut off from a position and inserted into another.

Up to a point the comparison of complete genomes is reduced to individually compare all genes of the corresponding genomes and integrate such information. To do this, often a similarity matrix is calculated at a genomic scale by determining the number of genes n in the first genome to be compared and the number of genes m in the second genome. Then, a matrix of order n x m is created where each cell i,j contains the percentage of amino acids in common between the gene i from first genome and gene j from the second. This is determined by constructing the optimal global alignment between two sequences using the Needleman-Wunsch algorithm.

The study of the relative order of genes in the chromosomes of evolutionarily close species is called synteny. Typically synteny blocks are searched, long DNA fragments which preserves the order of genes between species, as well as identifying areas where there have been inversions and transpositions.

A first graphical approach for the study of synteny between the genomes of two organisms is to build a dot-plot, where in the horizontal axis the genes of first genome are positioned and on the vertical axis the genes of the second genome, in the order they are found in the corresponding genomes. A point is drawn at position (i,j) where i is a gene homologous to gene j.

In a dot-plot regions of genomes which conserves the relative order of genes are observed as visible segments in the main diagonal, regions where there has been shown as an inversion in the diagonal segments perpendicular to the main and transposed regions are visible as segments parallel to the main diagonal.

The algorithm that calculates the synteny between two genomes has been implemented in GetSyntenyMatrix function

An example of synteny in cyanobacteria

Synechococcus elongatus strains PCC 6301 and PCC 7942 are a good example of synteny between two organisms. The first one, Synechococcus elongatus PCC 6301, has 2523 proteins and the second one, Synechococcus elongatus PCC 7942, has 2612. These two organisms have 2581 homologous genes with a percentage of identical amino acids over 50%, 2482 over 75% and 1636 equal to 100%.

The resulting dot-plot of synteny between this two organisms shows four synteny blocks, none of them is in the main diagonal, that means there are not homologous genes at the same position in both genomes. There are two synteny blocks that show inversions, the first one has about 1430 genes, and it is positioned between positions 94 and 1494, at the first genome and between position 1 and 1461 at the second genome. The second region where an inversion is noted has about 970 genes; it is from position 1495 to 2449 at the first genome, and from position 1633 to 2612 at the second genome. Finally, there are two regions that show transpositions, the first one has about 94 genes and the second one has about 76. The first transposed synteny block is located in the diagonal between positions (1, 1539) and (94, 1633), and the second synteny block can be noted in the diagonal between positions (2448 ,1461) and (2523, 1538).

The next figures show synteny between Synechococcus elongatus strains PCC 6301 and PCC 7942, assuming that homologous genes have a percentage of identical amino acids over 50% (Figure 5.3), over 75% (Figure 5.4) and equal to 100% (Figure 5.5). It can be noted that even when the percentage of identical amino acid is equal to 100% the four synteny blocks are still at the same places.

Figure 5.3: Synteny between Synechococcus elongatus strains - Percentage of identical amino acids over 50%

Figure 5.4: Synteny between Synechococcus elongatus strains - Percentage of identical amino acids over 75%

Figure 5.5: Synteny between Synechococcus elongatus strains - Percentage of identical amino acids equal to 100%

## Algorithms Implementation¶

### GetGlobalAlignmentData¶

```# This function allows to obtain the dynamic matrix, the decision matrix and the alignment
# score through Needleman-Wunsch Algorithm
# Arguments:
#       sequence.1: the first sequence to align
#       sequence.2: the second sequence to align
#       substitution.matrix: the substitution matrix
# Returns:
#       A list that contains the dynamic matrix, the decision matrix and the
#       alignment score
getGlobalAlignmentData <- function(sequence.1, sequence.2, substitution.matrix)
{
# calculates the length of sequences and creates a matrix of order (n+1) x (m+1)
# to store the partial alignment score
n <- length(sequence.1)
m <- length(sequence.2)
dynamic.table <- matrix(nrow=(n+1),ncol=(m+1))

# creates a matrix of order (n+1) x (m+1) to store the decisions made
decision.table <- matrix(nrow=(n+1),ncol=(m+1))
# stores the penalty gap value into a variable and then fill the first row and first
# column of the dynamic table with that value.
gap.penalty <- substitution.matrix[nrow(substitution.matrix),1]
first.row <- seq(from=0,by=gap.penalty,length.out=(m+1))
dynamic.table[1,] <- first.row
first.column <- seq(from=0,by=gap.penalty,length.out=(n+1))
dynamic.table[,1] <- first.column

# sets the initial values of decision table
decision.table[1,1] <- 0
decision.table[1,2:(m+1)] <- 1
decision.table[2:(n+1),1] <- 2

# iterates over sequence.1 and sequences.2 to find out which is the best alignment option
for(i in 1:n)
{
for(j in 1:m)
{
# stores the three possible options
option3 <- (dynamic.table[i+1,j] + gap.penalty)
option2 <- (dynamic.table[i,j+1] + gap.penalty)
option1 <- (dynamic.table[i,j] + substitution.matrix[sequence.1[i],sequence.2[j]])
options <- c(option1, option2, option3)
# stores the best option alignment value and its decision
decision <- which.max(options)
dynamic.table[i+1,j+1] <- options[decision]
decision.table[i+1,j+1] <- decision
}
}
# obtains the score of the best alignment and sets up the result set
score <- dynamic.table[n+1,m+1]
result <- list(score=score,dynamic.table=dynamic.table, decision.table=decision.table)
return(result)
}```

### GetDecisionTraceback¶

```# This function allows to obtain the traceback of the Needleman-Wunsch Algorithm
# Arguments:
#       sequence.1: the first sequence to align
#       sequence.2: the second sequence to align
#       decision.table: the decision matrix
# Returns:
#       A list that contains the alignment length and the decisions made
getDecisionTraceback <- function(sequence.1,sequence.2,decision.table)
{
# initializes the indexes i and j pointing toward the last cell of the decision matrix
i <- nrow(decision.table)
j <- ncol(decision.table)
# initializes a variable to store the length of the alignment and the decision made
alignment.length <- 0
decisions <- numeric()

# iterates over the decision matrix to find out the alignment length and to fill
# the decision vector
while(i != 1 || j != 1)
{
alignment.length <- alignment.length + 1
decisions[alignment.length] <- decision.table[i,j]
if(decision.table[i,j] == 1)
{
i <- i - 1
j <- j - 1
}
else if(decision.table[i,j] == 2)
{
i <- i - 1
}
else if(decision.table[i,j] == 3)
{
j <- j - 1
}
}

# reverses the order of the decision made and then stores the result
decisions <- rev(decisions)
# sets up the result set
result <- list(alignment.length=alignment.length, decisions=decisions)
return(result)
}```

### GetAlignmentMatrix¶

```# This function allows to construct the alignment of two given sequences before executing
# the Needleman-Wunsch Algorithm
# Arguments:
#       sequence.1: the first sequence to align.
#       sequence.2: the second sequence to align.
#       alignment.length: the length of the optimal alignment.
#       decisions: vector containing the decisions made
# Returns:
#       The alignment matrix
getAlignmentMatrix <- function(sequence.1, sequence.2, alignment.length, decisions)
{
# creates the alignment matrix of order 3 x alignment.length, and initializes the indexes
# to iterate over the sequences
alignment <- matrix(nrow=3,ncol=alignment.length)
i <- 1
j <- 1

# iterates over the decisions vector to fill in each row of the alignment matrix
for(k in 1:alignment.length)
{
if(decisions[k] == 1)
{
alignment[1,k] <- sequence.1[i]
if(sequence.1[i] == sequence.2[j])
{
alignment[2,k] <- "|"
}
else
{
alignment[2,k] <- " "
}
alignment[3,k] <- sequence.2[j]
i <- i + 1
j <- j + 1
}
else if(decisions[k] == 2)
{
alignment[1,k] <- sequence.1[i]
alignment[2,k] <- " "
alignment[3,k] <- "-"
i <- i + 1
}
else if(decisions[k] == 3)
{
alignment[1,k] <- "-"
alignment[2,k] <- " "
alignment[3,k] <- sequence.2[j]
j <- j + 1
}
}
return(alignment)
}```

### GetAminoAcidMarkovModel¶

```# This function allows to obtain the Markov model of an amino acid sequence
# Arguments:
#       sequence: an amino acid sequence
# Returns:
#       The Markov model consisting of three elements: the initial probabilities,
#       the transition matrix and the states vector.
getAminoAcidMarkovModel <- function(sequence)
{
# initializes the states vector
states <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L",
"M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")

# obtains the absolute frequencies for amino acid pairs, then creates the transition
# matrix T of order 20x20
aa.pairs.freq <- count(sequence,2,alphabet = states)
T <- matrix(aa.pairs.freq,nrow=20,ncol=20,byrow=TRUE)

# normalizes the value for each row of the transition matrix
for(i in 1:20)
{
T[i,] <- T[i,] / sum(T[i,])
}

# sets the columns and rows names for the transition matrix
rownames(T) <- states
colnames(T) <- states

# calculates the values of the initial probabilities
initial <- table(sequence) / length(sequence)

# sets up the result sets
result <- list(initial.probabilities=initial,transition.matrix=T,states=states)
return(result)
}```

### GetRandomSequence¶

```# This function allows to generate a random sequence
# Arguments:
#       initial.probabilities: the initial probabilities
#       transition.probabilities: the transition matrix
#       sequence.length: the length of the random sequence to generate
#       sequence.symbols: the symbols to use in the random sequence
# Returns:
#       A random sequence
getRandomSequence <- function(initial.probabilities,transition.probabilities,
sequence.length,sequence.symbols)
{
# creates the vector random.sequence of length sequence.length, and sets the value
# of its first element
random.sequence <- character(sequence.length)
random.sequence[1] <- sequence.symbols[rmultinom(1,1,initial.probabilities)==1]

# calculates the value for the rest of elements in random.sequence
for(i in 2:sequence.length)
{
multinomial.model.row <- transition.probabilities[random.sequence[i-1],]
random.sequence[i] <- sequence.symbols[rmultinom(1,1,multinomial.model.row)==1]
}
return(random.sequence)
}```

### GetAlignmentSignificance¶

```# This function allows to estimate the p-value associated with probability to find a
# random sequence whose score would be higher than the optimal alignment score.
# Arguments:
#       sequence.1: the first sequence to align
#       sequence.2: the second sequence to align
#       substitution.matrix: the substitution matrix
#       number.randomisations: the number of randomizations
# Returns:
#       The p-value and the random scores vector
getAlignmentSignificance <- function(sequence.1, sequence.2, substitution.matrix,
number.randomisations)
{
# calculates the alignment score through the pairwise alignment method
alignment.score <- pairwiseAlignment(c2s(sequence.1), c2s(sequence.2),
substitutionMatrix=substitution.matrix,
type="global", gapOpening=0,scoreOnly=TRUE)
# calculates the Markov model
markov.model <- getAminoAcidMarkovModel(sequence.2)

# initializes the variable to store the number of random alignments having a greater or
# equal score to the optimal alignment and the vector to store the random alignment scores
number.better.scores <- 0
random.scores <- numeric(number.randomisations)

# calculates the number of better scores and the random scores
for(i in 1:number.randomisations)
{
print(paste("Randomisation ", i))
random.sequence <- getRandomSequence(markov.model[["initial.probabilities"]],
markov.model[["transition.matrix"]],
length(sequence.2),unique(sort(sequence.2)))
random.alignment.score <- pairwiseAlignment(c2s(sequence.1), c2s(random.sequence),
substitutionMatrix=substitution.matrix,
type="global", gapOpening=0,scoreOnly=TRUE)
random.scores[i] <- random.alignment.score
if(random.alignment.score >= alignment.score)
{
number.better.scores <- number.better.scores + 1
}
}

# sets up the result set
result <- list(p.value=number.better.scores/number.randomisations,
random.scores=random.scores)
return(result)
}```

### GetLocalAlignmentData¶

```# This function allows to obtain the local alignment through the Smith-Waterman Algorithm
# Arguments:
#       sequence.1: the first sequence to align
#       sequence.2: the second sequence to align
#       substitution.matrix: the substitution matrix
# Returns:
#       A list that contains the dynamic matrix, the decision matrix and the alignment score
getLocalAlignmentData <- function(sequence.1, sequence.2, substitution.matrix)
{
# calculates the length of sequences and creates the matrix of order (n+1) x (m+1)
# to store the partial alignment score and the decisions
n <- length(sequence.1)
m <- length(sequence.2)
dynamic.table <- matrix(nrow=(n+1),ncol=(m+1))
decision.table <- matrix(nrow=(n+1),ncol=(m+1))

# stores the penalty gap value
gap.penalty <- substitution.matrix[nrow(substitution.matrix),1]

# sets the initial values of scores and decisions tables
dynamic.table[1,] <- 0
dynamic.table[,1] <- 0
decision.table[1,1] <- 4
decision.table[1,2:(m+1)] <- 4
decision.table[2:(n+1),1] <- 4

# iterates over sequence.1 and sequences.2 to find out which is the best alignment option
for(i in 1:n)
{
for(j in 1:m)
{
option3 <- (dynamic.table[i+1,j] + gap.penalty)
option2 <- (dynamic.table[i,j+1] + gap.penalty)
option1 <- (dynamic.table[i,j] + substitution.matrix[sequence.1[i],sequence.2[j]])
options <- c(option1, option2, option3,0)
decision <- which.max(options)
dynamic.table[i+1,j+1] <- options[decision]
decision.table[i+1,j+1] <- decision
}
}

# obtains the best score
score <- max(dynamic.table)
result <- list(score=score,dynamic.table=dynamic.table, decision.table=decision.table)
return(result)
}```

### GetLocalDecisionsTraceback¶

```# This function allows to obtain the traceback of the Smith-Waterman Algorithm
# Arguments:
#       sequence.1: the first sequence to align
#       sequence.2: the second sequence to align
#       dynamic.table: the scores table
#       decision.table: the decision matrix
# Returns:
#       A list that contains the alignment length and the decisions made
getLocalDecisionsTraceback <- function(sequence.1,sequence.2,dynamic.table,decision.table)
{
# initializes variables and indexes
indices <- which(dynamic.table==max(dynamic.table),arr.ind=TRUE)
i <- indices[[1]]
j <- indices[[2]]
alignment.length <- 0
decisions <- numeric()

# iterates over the decision matrix to find out the alignment length and to fill
# the decision vector
while(decision.table[i,j] != 4)
{
alignment.length <- alignment.length + 1
decisions[alignment.length] <- decision.table[i,j]
if(decision.table[i,j] == 1)
{
i <- i - 1
j <- j - 1
}
else if(decision.table[i,j] == 2)
{
i <- i - 1
}
else if(decision.table[i,j] == 3)
{
j <- j - 1
}
}

# reverses the order of the decision made and then stores the result
decisions <- rev(decisions)
# sets up the result set
result <- list(alignment.length=alignment.length, decisions=decisions,ini.1=i,ini.2=j)
return(result)
}```

### GetLocalAlignmentMatrix¶

```# This function allows to construct the local alignment of two given sequences before
# executing the Smith-Waterman Algorithm
# Arguments:
#       sequence.1: the first sequence to align
#       sequence.2: the second sequence to align
#       alignment.length: the length of the optimal alignment
#       decisions: vector containing the decisions made
#       ini.1: the position of index i
#       ini.2: the position of index j
# Returns:
#       The alignment matrix
getLocalAlignmentMatrix <- function(sequence.1, sequence.2, alignment.length,
decisions, ini.1, ini.2)
{
# creates the alignment matrix of order 3 x alignment.length, and initializes the
# indexes to iterate over the sequences
alignment <- matrix(nrow=3,ncol=alignment.length)
i <- ini.1
j <- ini.2

# iterates over the decisions vector to fill in each row of the alignment matrix
for(k in 1:alignment.length)
{
if(decisions[k] == 1)
{
alignment[1,k] <- sequence.1[i]
if(sequence.1[i] == sequence.2[j])
{
alignment[2,k] <- "|"
}
else
{
alignment[2,k] <- " "
}
alignment[3,k] <- sequence.2[j]
i <- i + 1
j <- j + 1
}
else if(decisions[k] == 2)
{
alignment[1,k] <- sequence.1[i]
alignment[2,k] <- " "
alignment[3,k] <- "-"
i <- i + 1
}
else if(decisions[k] == 3)
{
alignment[1,k] <- "-"
alignment[2,k] <- " "
alignment[3,k] <- sequence.2[j]
j <- j + 1
}
}

return(alignment)
}```

### GetProteinIdentity¶

```# This function allows to calculate the percentage of identical amino acids
# between two sequences already aligned.
# Arguments:
#     sequence.1: the first sequence
#     sequence.2: the second sequence
# Returns:
#     The percentage of identical amino acids
getProteinIdentity <- function(sequence.1, sequence.2)
{
return((sum(s2c(sequence.1) == s2c(sequence.2))/length(s2c(sequence.1))) * 100)
}```

### GetSyntenyMatrix¶

```# This function allows to compare two genomes to find their homologous genes
# Arguments:
#     gene.list.1: the first amino acid sequence of genes
#     gene.list.2: the second amino acid sequence of genes
#     substitution.matrix: the substitution matrix
# Returns:
#             The similarity matrix
getSyntenyMatrix <- function(gene.list.1,gene.list.2,substitution.matrix)
{
# creates the similarity matrix of order n x m
n <- length(gene.list.1)
m <- length(gene.list.2)
genome.similarity <- matrix(nrow=n,ncol=m)

# iterates over the first genome to obtain gene sequences in gene.list.1
for(i in 1:n)
{
print(paste("Comparison ",i))
# iterates over the second genome to obtain gene sequences in gene.list.2
for(j in 1:m)
{
# calculates the optimal alignment score through the pairwise Alignment method
alignment <- pairwiseAlignment(c2s(gene.list.1[[i]]),c2s(gene.list.2[[j]]),
substitutionMatrix=substitution.matrix,
type="global",gapOpening=0)

# stores the alignment score in the similarity matrix
genome.similarity[i,j] <- getProteinIdentity(as.character(pattern(alignment)),
as.character(subject(alignment)))
}
}

return(genome.similarity)
}```

Footnotes

 [1] UniProt Knowledgebase - Primary accession number: Q7V6F8, http://www.uniprot.org/uniprot/Q7V6F8
 [2] UniProt Knowledgebase - Primary accession number: P00877, http://www.uniprot.org/uniprot/P00877