Monday, February 6, 2017

On the shape of nucleotide motifs

Genomes of multi-cellular organism often contain short sequences with many duplicates which have been created and preserved with minor mutations during their evolutionary processes. Those nucleotide sequences, called here generically motifs, carry important information about the structure and history of those genomes. Better understanding of those motifs could help us to explore the function and evolutionary path of genetic code.

In this note I'll put forward a method to visualize motifs as 3D dotted maps, where each motif is represented by a dot in the 3D space; and sub-groups of motifs may take different shapes, which I'll try to interpret in terms of evolutionary variations.

I have put forward some methods to visualize DNA sequence a while ago. In those early attempts, different scanning machines were used to extract high dimensional vectors from DNA sequences; which are then mapped into 3D space with multidimensional scaling algorithm. A main problem of those methods is that they are often not very sensitive towards sequential variations, so that the resulting maps lack capability to reveal structural information. The method presented here basically follows the same framework, except that we'll not use scanning machine, but feed the motif sequences as data points directly into the multidimensional scaling algorithm which uses the neadleman-wunsch algorithm to calculate the editing distance between the motifs.

In the following I'll use the chromosome I of Caenorhabditis Elegans as an example to detail the steps of the visualization process.

Firstly, I downloaded the sequence of chromosome I of C. Elegans from the NCBI site (NC_003279) I then created an blast database for the chromosome with the makeblastdb program;  then I used the blastn program to query on all sub-sequences of length 30 nucleotide bases; A sub-sequences will be recorded as motif, if it has more than 30 duplicates on either strand of the chromosome, where a fuzzy matching with one or two mutation or deletion/insertion are employed. In this way 4578 motifs have been found among the chromosome I that has length of about 15 million nucleotide bases. The following picture shows the distribution of these motifs on the chromosome:


In above picture nucleoside types A, C, G, T are shown by color yellow, magenta, red and cyan respectively; The locations of motifs are marked by blue bars. We notice that many motifs overlap each other. Those overlapping motifs are merged together to form the 4578 disjunctive sequences of different length as be shown in following table:



I then apply the t-SNE algorithm on these 4578 motifs to create a 3D map. I used needleman-wunsch algorithm to calculate the minimal editing distance between the motif sequences. The needleman-wunsch algorithm uses a simple penalty schema of (2, 0, 1) for gape, match and mismatch respectively. The following picture shows the resulting 3D dotted map:



Then a clustering algorithm (a combination of k-mean algorithm and mean-shifting algorithm) is applied to the 3D map that results in coloring of major clusters with different colors. With some minor manual color adjustment, we obtained a colored 3D dotted map as shown in the following animation:



From the above map we can roughly find clusters of the following shapes:


Assuming that all motifs are created by duplication and random single nucleotide mutations, we could interpret above shapes as following: Motifs form shape of small blobs indicate those motifs originated from a common ancestor and have undergo minor mutations. Those motifs could be either recent duplicates, so that they don't have time to accumulate large number of mutations. Or, if they exist for long time, their mutation with large number of mutation did not survive the evolutionary selection process.

Motifs form the shape of curves indicate that those motifs, originating from a common ancestor, have mutated gradually away from its initial copy. With help of VisuMap software we can locate those motifs in the chromosome and trace their evolutionary process along the curve ( but we cannot say which end of the curve is the starting end.) The following pictures show a case with a curved motif collection:

In above pictures, the top picture shows a section of the 3D motif map; the second picture shows corresponding locations of those motifs on the chromosome. With VisuMap you can interactively move the selection of motifs in the window displaying the first picture, so that the second picture will mark the location of selected motifs. The last picture shows the nucleotide sequences of these motifs, one motif per row, in the order along the curve from left to right. We can see that these motifs gradually shifted their phases.

We notice some curved motif groups have small blobs attached to them. These blobs might indicate short bust of different mutations.

Interestingly, we see two motif clusters in the 3D map which form roughly two parallel rings. Those motifs are distributed across the whole chromosome, they may have some internal relationship, but I'm not sure in moment how to interpret them in terms evolutionary process.

In addition to C. Elegans, I also applied the visualization algorithm to other organisms. The following short video clips shows several motif maps of the rice and human genome:


It should be noted that for the human chromosomes, calculating the motif map with the current implementation for all motifs would be computationally prohibitive. So, I used a simplified version of blastn; and selected three small sub-sets, each with about 20 000 motifs, for above motif maps. The three sub-sets have increasing duplicate frequencies. So, for instance, motifs of the third set all have over 10 000 duplicates; whereas the first set has motifs with 300 to 500 duplicates.

From above maps we can see that the motifs of rice genome are much less structured: almost all of them are just small blobs. Those motifs are, unlike the case with C. Elegans, mostly located in coding area. On the other side, the motif map of the human genome shows diverse structures depending on the duplicate frequencies; and the third motif map evens shows some kind of 2-dimensional structure.

Discussion

The visualization method presented here provides a simple way to extract geometrical models from pure discrete genetic sequence information. These geometrical shape provide intuitive information about the evolutionary path of genetic code; and they might be helpful to profile and explore whole genome structures.

The method presented here has three key components: the way to find motifs; the way to calculate editing distance and the multidimensional scaling algorithm (MDS). Each of these three methods has significant impact on the results and they also pose significant limitations. For instance, the current implementation of these components are quite slow, so that it is difficult to visualize large motif sets. I have experimented with other editing distance, like the smith-waterman and levensthein method, they seem to perform similarly, needleman-wunsch just produces slightly better visualization. For the future investigation we might work on finding more efficient MDS algorithm or finding more efficient distance metric to help us to attack much larger and different groups of genetic code.



Tuesday, June 21, 2016

VisuMap version 4.7 with Whole Genome Browser Released.

I have just released VisuMap version 4.7 on our web site. Apart from many GPU related optimizations of 3D data visualization, this release includes a heat map style data view to visualize long discrete sequences. A whole genome with billions of nucleotides can be aggregated and displayed in a single 2D map. The following screen snapshot shows, for example, the exome distribution of the first human chromosome:



We notice that most genome browsers current available are depth oriented: their exploration are normally focused on relatively small sections of the genome. The whole genome browser in VisuMap offers a new way to help people to explore genome wide patterns and profiling. 

Wednesday, December 9, 2015

Visualizing Gene Sequences with complementary GMS

We now know that there are about 20 000 protein coding genes in the whole human genome. How do they relate to each other?  Is there any geometrical structure among them? What  shape do they form?  Aiming at these questions,  I'll put forward here a method to generate scatter plot from gene sequences that shows the shape of a gene collection.

In recently blogs, I have described GMS framework to profile gene sequences which translates discrete sequence to high dimensional curvature vectors. We could use this method to convert a collection of gene sequences to a set of high dimensional vectors and then apply MDS algorithm to embed them in a low dimensional space. However, if we restrict our focus on protein coding sequence (CDS), the length of CDS varies from few hundreds to tens of thousands nucleotides. That means, the profiling curvature vectors for CDS sequences are of very different dimensions. This in general causes difficulties for most MDS algorithms as they normally accept only vectors of a common dimension.

In order to obtain profiling vectors of the same length from sequences of variable length, we extended our GMS framework as follows: we first select a short reference sequence; Then for each CDS sequence, we mix the reference sequence with the CDS sequence and pass them together to the GMS framework. After the GMS framework embedded the two sequence into the low dimensional space, we then calculate the curvature profile of the curve of the reference sequence, and use it as a profiling vector for the CDS sequence. The modified framework is depicted as follows:

We notice that the reference sequence and the CDS sequence are embedded together in the same space, so that their curves occupy complementary area of the low dimensional space. That means the curvature profile of the reference sequence carries information about the complementary space of the CDS sequence. More generally, we can expect that different CDS sequences will produce different complementary profile as they have different impact on the reference sequence.

For this note I have downloaded the CDS sequences of the first chromosome of the human genome from the Ensembl.org web site; There are about 2000 protein coding genes whose length very from few hundreds to more than 26 thousands nucleotides. As reference sequence, I picked a short CDS sequence of CDS collection (the gene ENSG00000187170) that has only 300 nucleotides. The following picture shows the complementary curvature profile the CDS sequences: each row of the heatmap shows the curvature of the reference sequence mixed with one CDS sequence; bright colors indicate high curvatures. The line chart under the heatmap shows the complementary profile of 4 selected CDS sequences marked in the heamap


We see that these profiles at large have high curvatures at about half dozens regions, but the height and exact location of those high curvature peaks depend on the corresponding CDS sequence. It is important to notice that I have in the previous blog demonstrated that the curvature profile is a characteristics that is invariant from the random nature of the mapping algorithm. Thus, those small variation are not due to the random nature of the mapping algorithm used, they ought to be attributed to different interaction (or "affinity") between the CDS sequence and the reference sequence.

Having obtained one complementary curvature vector for each CDS sequence, we can then apply a MDS algorithm on these high dimensional vectors to embed them in low dimensional space. For this purpose, I have used the tSNE algorithm. The following picture shows a 3D map for these CDS sequences:
Curvature Profile Map of 2045 CDS sequences created with tSNE algorithm


The above map is linked to a short video, you can click on the map to see the 3D map in animation. Also, for the purpose of easy exploration I have added different colors to the different regions on the map with the k-Mean clustering algorithm.

It should be noticed that the selection of the reference sequence has essential impact on the curvature profile and the final map. To see this, I have picked another sequence, the CDS sequence of gene with id ENSG00000158481 (with about 1000 nucleotides,) as reference sequence and repeated the whole procedure. The following picture shows the corresponding complementary curvature profile and the tSNE map of the 2045 CDS sequences:

Curvature Profile and tSNE map of the 2045 CDS sequences with ENSG00000158481 as reference sequence
We see that both the curvature profile and the final map are very different from those created previously with a different reference sequence.

It is not yet clear how the reference sequence impacts the complementary curvatures. Intuitively, the reference sequence plays a role like the a projection plane that we often use to project high dimensional data to 2-dimensional space for visualization purpose. Thus, selecting an appropriate reference sequence might help us to see interesting geometrical structure among large collection of gene sequences.

About the sample data

The sample genome data is downloaded from Ensembl.org web site in fasta format. The fasta data file is imported into VisuMap with a special plugin. The zipped file here contains the VisuMap sample file, the plugin for importing fasta files and a script "CurvatureProfiling.js" to run the simulations. Before running the simulation, the plugin module GeneticAnalysis.dll need to be installed into VisuMap.







Saturday, September 19, 2015

Curvature as secondary profiling of GMS maps.

As a profiling method, GMS translates discrete sequences to space curves which help us to explore sequence features (treats) with help of geometrical shapes. Yet, space curves still pose a challenge for direct pattern recognition. This is especially so, when we try to explore patterns among large collection of  sequences.

In this note I'll put forward a method to use curvature to capture characteristic of GMS space curves. Some experiments with sample data will show that curvature profile are suitable for pattern analysis of large collection of GMS space curves.

Curvature of  space curves

Curvature is a concept introduced in mathematics to measure how far a segment of a geometrical shape deviates from being flat. Intuitively for space curves, the curvature of a curve at a point measures how strong the curve is bent at the point.

More formally, Let P1, P2, ..., be a series of points in the 3 dimensional space representing a space curve, then for the purpose of this note, we use the following simple formula to estimate the curvature, κk, at point Pk:
As shown in above diagram, α is the angle between the two segments PkPk+1 and Pk+1Pk+2; s is the length of the segment PkPk+2. If Pk are the 3-dimensional coordinate vectors of the points, then the above formula becomes:
Thus, the curvature  profile of space curve is a series real number values, κk, that measure how strong the curve is bent at each point. In practice, when we calculate curvature profile of a GMS curve, we don't need to calculate the curvature at each point, but just at points sparsely and evenly sampled from the the GMS curves. For the sake of simplicity, we simply select one point from for each input sequence node. That means, for a sequence with n nodes, we calculate a n-dimensional curvature profile. The following diagram shows the whole work-flow to calculate the curvature profile from a discrete sequence:



Above work-flow is mainly an extension of the one in a previous note, where more detailed explanation is provided. The only modification to the work-flow is an extra step that calculates  curvature profile from the GMS curve of the first clone. As will be discussed latter in this note, it has been turned out in the simulation experiments that the clones normally have about the same curvature profile.

Characteristics of curvature profile

Curvature profile has several properties which make it suitable to characterize GMS space curves. First, just from its definition we know that the curvature profile is rotation and shifting invariant, so that rotation and shifting information are automatically removed from curvature profile. Because of this running GMS algorithm multiple times on a sequence will results in the same curvature profile, even when the corresponding space curve are rotated and shifted differently due to some random nature within the GMS algorithm.

Going one step further we can change the number of clones in the GMS algorithm. The resulting spaces curves for the clones normally vary systematically in shape and size; Yet their corresponding curvature profiles are more or less similar to each other. The following diagram shows the curvature profiles of  5 clones of sequence with 5 fold cloning:

We notice in above picture that the curvature profiles for the middle or inner clones have noticeably larger peaks than those of the outer clones; and there is symmetry alone the middle (the third) profile. These curvature profiles shows that all 5 space curves bend more or less at common positions and the inner curves bend slightly more than those outer curves. On the right side of above picture are the 5 curvature profiles displayed as a heatmap. As will be seen below, heatmap offers a better way to visualize large collection of curvature profiles.

Another interesting property of the curvature profile is that they are in general independent on the sampling frequency of the GMS algorithm. High frequency sampling normally results in smoother space curves, but their shape remain mostly unchanged. The following picture shows the curvature profile of another sequence with sampling frequency of 6, 8, 10, 12 and 14 scans-per-node. The 5 curvature profile are almost identical.



Profiling Transmutation

In addition to the invariant characteristics, the curvature profiles of similar sequences are in general similar. This property enables us to study systematical variation among large collections of sequences by visualizing their corresponding curvature profiles.  We demonstrate this method here with an example that simulates a transmutation process. As the be shown in the following diagram, transmutation is the process that one sequence gradually mutates to another sequence.

For the sake of simplicity we assume that both sequences, A and B, have the same length N. A simple transmutation process is that an increasingly longer sub-sequence A replaces a corresponding sub-section in B. More particularly, the simple transmutation is realized by the sequence Tk(A, B) := A[1, k]*B[k+1, N]; where k=1, 2, ..., N; A[1,k] is the sequence of the first k nodes of A; B[K+1, N] is the sequence of the last N-k nodes of B; * is the concatenation operation.

With help of GMS algortihm we can calculate the space curves of Tk; k=1,2,..., N; and then calculate their curvature profiles as N-dimensional vectors. For this example we have used two sequences of 1377 nodes (one of them is the CCDS of CD4); The following picture shows their 1377 curvature profiles displayed as an single heat map (the three curve charts on the right side are profiles of the transmutation sequence at 3 particular times) :

In above heat map we can notice a variation region that runs from top left corner to the right bottom corner, this feature indicates that during the transmutation process, only the part of the space curve that is close to the mutated node undergoes significant change. In other words, features of space curve can be traced back to individual nodes in the transmutation sequence.

As the curvature profiles are vectors of the same dimension we can use a MDS (multidimensional scaling algorithm) algorithm to map them to a low dimensional space, so that we can see their sharp. The following pictures is a 2-dimensional map generated with the t-SNE algorithm for those curvature profiles in above example:


In above map, each dot represents the curvature profile of a transmutation sequence. As a property of MDS maps,  closely located dots normally represent transmutations with smaller effects on curvatures. Thus, this map can help us to locate sequence regions which cause large change in space curve.

Discussion 

Curvature profile purposed in this note is a secondary characteristics of discrete sequences, it can help us to study patterns and structures among large collection of discrete sequences. Whereas GMS algorithm provides a mean to geometricalize individual sequences; curvature profile offers a tool to geometricalize a set of sequences as a whole. Various invariant properties and visualization experiments discussed in this note seem to validate curvature profile as suitable characteristics for discrete sequences.

During the searching for such characteristics I have experimented with several other quantities, like quantities derived from the rotation speed, torsion and velocities. Curvature profile as purposed above seems to have the best properties among all the tested quantities. It should also be pointed out that the curve curvature purposed in this note differs slightly from the standard defintion. The reason for this discrepancy is that the standard mathematical definition is unstable when the sampling points are located in tiny regions with noise. For evenly distributed curve sampling points, the purposed curvature formula approximates the standard definition pretty well.

In general, curvature profile together with certain MDS algorithm could provide an interesting tool to explore similarities and patterns among sequential structures and shapes encountered in microbiology.