Tuesday, September 23, 2014

On geometric modeling of sequential data

The biological cellular system appears to be a universal machine that is capable to translate discrete sequential data, i.e. genetic code, to all kinds of organisms with diverse phonetic features. In this note I am going to put forward a computational framework that translates discrete sequential data to lower dimensional geometrical shapes. Inspired by biological cellular system, this framework first samples high dimensional vectors from the discrete data; then it applies dimensionality reduction methods to embed those vectors into low dimensional space. The resulting map normally forms shapes with geometrical properties that reflect information in initial sequence.

Let's start with an input sequence of nucleic acids sk;  k=0, 1, 2, ..., K  with sk  ∈ S:={C, G, A, T}. To illustrate the sampling process, we image that the sequence is a series of marbles of 4 different imaginary colors. Mathematically,  these "colors", i.e. the set S, can be represented by 4 dimensional vectors each with just a single non-zero component as follows:

Now, image we have a camera that moves slowly from one end to the other end of the sequence; and during the movement we quickly take a series of pictures of the two closest marbles. These pictures record different colors and intensities of the two closest marbles. As depicted in the following schema, we code those "pictures" as 8 dimensional vectors:

In above schema, α and β are the intensities coefficients for the two closest marbles (they can be multiplied directly into the first half and second of the 8-vectors correspondingly). α and β are calculated as follows: We assume that the camera moves evenly one marble per time unit. So at given time, say t1,  the camera is located between k-th and k+1-th marble; then the phase of the camera is calculated as:   p1 :=  t1 - k ;  where k the largest integer smaller than  t1, i.e. k=⌊t1⌋. Then,  α and β are calculated as: α := cos(p1π/2);  β:=sin(p1π/2).  We notice here that when the camera moves smoothly from k to k+1-ε  for any infinitesimal small number ε,  the pair (α, β)  changes smooth from (1, 0) to (0, 1).

However, at the time point k+1, the pair (α, β) takes the value (1, 0), since the camera is now considered between the k+1-th and k+2-th marble.  That means, (α, β) as a time dependent 2-dimensional variable is not continues at the integer time points; Consequently, the corresponding  8-vectors in general won't change smoothly in the 8-dimensional space.

In order to avoid this discontinuity, we swap the two 4-vectors of a 8-vectors for those  vectors produces when ⌊t1⌋ is an odd integer. That means in above schema, at the time  t2, the 8-vector produced is swapped from v2 to v'2 as shown in the following diagram:
With this swapping operation, we can easily verify that the 8-dimensinal vectors sequence actually changes smoothly as the camera smoothly processes.

Additionally to these 8 components, the sampling process at each step also adds a timestamp ct as the 9-th components to the output vectors, where c is a small constant chosen as an algorithm parameter.

Summing-up above steps, the sampling process can be denoted as an operator that operates on discrete sequences s as following:
where t ∈[0, K] is the the time parameter; c is an algorithmic parameter; k:= ⌊t⌋ is the sequential index; α := cos(pπ/2) and β:=sin(pπ/2) with p:= t- k as intensity coefficients for the two closest marbles.

After we have sampled a set of 9-dimensional vectors, we then use a dimensionality reduction (DR) algorithm to embed those vectors into 2- or 3-dimensional space. For our framework, we need a DR algorithm that is strong in preserving local information, since the "camera" basically samples only local information. Algorithms like RPM, CCAAffinity Embedding and t-SNE appear to be good candidates for this purpose. For this note I picked the t-SNE algorithm implemented in VisuMap software. 

As first example we use our algorithm to model a short nucleic acid sequence "CCCTGTGGAGCCACACCCTAG". Notice that the timestamp coefficient in  above description is the only component that grows with time unlimitedly; whereas the other 8 components are confined to the range between 0 and 1.0.  Thus, larger c will likely stretch the model further apart. In order to show this property, we created four different data sets with c=0, 0.5/N, 2.5/N, and 10/N;  where N is the number of samplings, which is in this note 15000. The following four clips show the resulting geometric models of these data sets. For the sack of clarity, I have colored the data points with increasing brightness as time progresses; and when the scanning camera is at closest to a node, the corresponding sampling data point will be represented with an icon that indicates the type of the nucleic acid.

Above models showed clearly that the coefficient c  controls how far the model get stretched. The other 8 components contributes information to the folding structure.

The second example compares models created by reversed sequence. The following clip compares the model of previous sequence with the model of its reversed sequence. We can see that the two models are geometrically more or less unchanged, except that the gradient direction is reversed. This means that the described modeling method is invariant under the sequence inversion.

The third example shows how the model changes when the sequence is duplicated multiple times. The following video clip shows the model of our sample sequence and the model of 5-time duplicates of that sequences. We can see that model of the 5-time duplicates resembles 5 time overlaps of the single sequence model.

As next example, we created a model for the protein sequence MELNSSFWTLIKTKMKSRNDNNKLLDTWLDPIEYVSTTGSADRP. The slightly more complicated model is shown in the following video clip:

We notice that the sampled vectors are all sparse vectors. In practical implementations, we can directly calculate the Euclidean distance between those vectors, without explicitly calculating those high dimensional vectors. In this way, we can model sequences of any base set. As final example
I have created a 3D model for the text sequence "One fish two fish red fish blue fish" as show in the following video clip:


The framework described here demonstrates a new way to derive geometrical models from discrete data. The framework is abstract in the sense that it is independent of the physical and chemical properties of the sequence.The experiments demonstrates that high dimensional feature space might be an effective way, as an intermediate stage, to derive 3 dimensional phonetic structure.

We notice that the models created above are all basically 1-dimensional curves folded in different ways. For the future study, it would be interesting to develop 2-dimensional sampling method that produces 2- or 3-dimensional models.  More speculatively for future study, can we develop an evolutionary process to find a sequence that gives rise to models with certain properties?

Monday, September 15, 2014

On affinity embedding

We have just released VisuMap version 4.2.902. A new mapping algorithm called Affinity Embedding (AE) has been introduced in this release. I'll briefly describe AE as a multidimensional scaling method(MDS);  or, in more recent terminology, as a dimensionality reduction method.

Like many MDS methods, AE tries to create a low dimensional representation of high dimensional data points by minimizing a stress function. AE's stress function is as following:
where aij is a positive number representing the affinity between data points i and j; dij is the Euclidean distance between that two points in the low dimensional space. Affinity as a generic concept quantifies certain kind of similarity between data entities. We notice that most MDS methods try to minimize difference (e.g. squared error) between a dissimilarity matrix and dij; whereas AE tries to approximate the similarity matrix.

Since many data sets from practices have been characterized by a distance matrix δij, VisuMap uses the following exponential affinity function to convert a distance matrix to an affinity matrix:
where cij are constants estimated based on densities around data points in the high dimensional space. This affinity function is just one simple way to convert dissimilarity to similarity; in fact any monotonous decreasing function may be used as affinity function. For more general applications, VisuMap offers a plugin interface to allow users to implement custom affinities.

AE uses a simple version of gradient descent method to minimize the stress function. The computational complexity for each iterative step is O(kN) where N is the number of data points and k is constant that is typically around value 100.  Moreover, the memory complexity of AE for each step is O(N). Thus, AE can be applied on relative large data sets, i.e. with a million and more data points.

The main characteristics of AE is its focus on local information (i.e. data with high affinity or low dissimilarity). The reason for this is that lower affinity are represented by close-to zero values and the optimization algorithm can more effectively ignore zero values; whereas other algorithms like CCA or Sammon map need to reduce the effect of large values (e.g. large distance or large dissimilarities.) which is normally harder to do.

AE has been largely inspired by the algorithm Stochastic Neighbor Embedding(SNE) and the more recent version t-SNE. Both SNE algorithms work pretty well in focusing on local information and both of them use a kind of exponential affinity functions in their stress function. However, because of their  probabilistic nature, they require in general quadratic computational and memory complexity, which makes it difficult to attack large data sets.

Like SNE, AE works pretty well in unfolding closed manifolds. The following pictures compare AE and t-SNE in mapping three spheres of different sizes in the 3D space to 2D plane. Notice that AE somehow preserves the size the three sphers; whereas t-SNE basically ignores the size information:

Data set forming three spheres in 3D space.

The t-SNE map for the three sphere data set.

The AE map for the three sphere data set.

Wednesday, January 15, 2014

VisuMap 4.2.896 Released

I have just released our software package VisuMap version 4.2.896.  In this release we have upgraded the graphics library DirectX from version 9 to version 11.  Due to this upgrade some of 3D animation data views have improved performance by more than 10 folds. VisuMap is now capable to interactively explore more than one million data points in 3D view on a normal desktop computer.

Also with this upgrade, VisuMap will require OS system Windows 7 or higher. Windows XP will be continually supported, but without new features which take advantage of DirectX 11 library.  In the future, we'll likly implement more performance enhancements by tackling the computation power of GPU.

Monday, November 4, 2013

On Multidimensional Sorting

In a previous blog post I have talked about a method to convert photos to high dimensional datasets for analysis with MDS methods. This post will address the opposite problem: given a high dimensional dataset, can we convert it to a photo alike 2D image for easy feature recognition by human?

Let's consider our sample dataset S&P 500. A direct approach to create a 2D map is to simply convert the matrix of real values to a matrix of gray-scale pixels. This can be easily done in VisuMap with the heatmap view and the gray-scale spectrum. The following picture shows the S&P 500 dataset in normal line diagram view on the left side and the corresponding heatmap on the right side:

Each curve in the left picture corresponds to the price history of a S&P-500 component stock in a year. We notice heavy overlaps among the curves. On the right side, the heatmap represents the stock prices with row strips with different brightness. In the heatmap, we can spot roughly the two systematical downturns at the two slightly darkened vertical strips. There is however no discernible pattern between the rows, as the rows are just ordered by their stock ticks, so that neighboring relationship don't indicate any relationships between their price history.

One simple way to improve the heatmap is to group stocks with similar price history together. This is the task of most clustering algorithms. We can do this pretty easily in VisuMap. The following heatmap shows the same dataset after we have applied k-Mean clustering algorithm on the rows and re-shoveled the rows according to cluster assignments:

The k-Mean algorithm has grouped the rows into about 8 clusters, we can see that rows within each group represent rows with, more or less, similar values. However, the clusters are randomly ordered, and as well as the rows within a cluster. The clustering algorithm does not provide ordering information for individual data points.

Can we do better job than the clustering algorithm? To answer this question, let's recall what we actually tried to do above: we want to reorder the rows of the heatmap, so that neighboring rows are similar to each other. This kind task is basically also the task of MDS (Multdimensional Scaling) algorithms, which aim to map high dimensional data to low dimensional spaces while preserving similarity. Particularly in our case, the low dimensional space is just the one-dimensional space, whereas MDS in general have been used to create 2 or 3 dimensional maps.

Thus, to improve our heatmap,  we apply a MDS algorithm to our dataset to map it to an one-dimensional space, then use their coordinates in that one-dimensional space to re-order the rows in the heatmap. For this test, I have adapted the RPM mapping (Relational Perspective Map) algorithm to reduce the dimensionality of the dataset from 50 to 1, then used the one-dimensional RPM map to order the rows in the heatmap. The following picture shows a result heatmap:

We notice this heatmap is much smoother than the previous two.  In fact, we can see that the data rows gradually change from bright to dark color, as they go from the top to the bottom. More importantly, we don't see clear cluster structure among the rows, this is in clear contrary to the k-Mean algorithm that indicates 8 or 10 clusters. In this case, it is clear that k-Mean algorithm delivered the wrong artificial cluster information.

Now, a few words about the implementation. The multidimensional sorting has been implemented with RPM algorithm, since RPM allows gradually shrinking the size of the map by starting with 2D or 3D map. This special feature enables RPM to search for optimal 1D map configuration among a large solution space. We could easily adapt other MDS algorithms to perform multidimensional sorting, but we probably will have to restrict solely on one-dimensional space with those algorithms. A few years ago, I have already blogged about this kind of dimensionality reduction by shrinking dimension with some simple illustrations.

Since high dimensional datasets are generally not sequentially ordered, there are in general not an unique way to order the data. What we want to do is just find a solution that preserves as much as possible similarity information with the order information. Thus, an important question arises here: How do we validate the sorting results? How do we compare two such sorting algorithms?  A simple way to validate the sorting result is illustrated in the following picture:
So, in order to test the sorting algorithm we first take simple photo and convert it to high dimensional dataset that represents the gray-scale levels of the photo. We then randomize the order of the rows in the data table. Then, we apply the sorting algorithm on the randomized data table, and check how good the original photo can be recovered. I have done this test for some normal photos, the multidimensonal sorting was able to recover pretty well the original photo (see also the short attached video), as long as enough time is given to the learning process.

We have described an application of MDS algorithm for heatmap by using it to reduce data dimension to a single dimension. We might go a step further to use MDS to reduce data dimension to 0, so that data points will be mapped to a finite discrete points set. In this way, we would have archived the service of clustering algorithms. If this works, clustering algorithms can be considered a special case of MDS algorithms; and MDS algorithms might lead us to group of new clustering algorithms.

The multidimsional sorting service has been implemented in VisuMap version 4.0.895 as integrated service. The following short video shows how to use this service for the sample datasets mentioned above.

A more practical application of MDS sorting is for microarray data analysis where heatmaps are often used to visualize the expression levels of a collection of gens for a collection of samples. Normally, neither the gens nor  the samples are ordered according to their expression similarity, so that those heatmaps often appear rather random (even after applying additional grouping with hierarchical clustering algorithm.) The following picture shows, for instance, a heatmap for expressions of 12000 genes for about 200 samples:

After applying MDS sorting both on the rows and columns of above heatmap, the heatmap looks  as following:

The above heatmap contains theoretically the same expression information. But, just by re-ordering the rows and columns, it shows more discernible structure and information for human perception.