Tuesday, October 16, 2007

Symmetry between dimensionality reduction and data clustering

Dimensionality reduction and data clustering are two main types of services offered by VisuMap to support visual exploration of high dimensional data. Although there was no explicit plan in the initial design of VisuMap to develop these two types of services in parallel, they have evolved nicely together in the software architecture.

We might argue that the reason for the nice coexistence are because they are both indispensable to explore high dimensional no-linear data. But there is another more profound reason for this: namely, they are conceptually symmetry to each other.

In order to see this "symmetry" let us consider a high dimensional dataset as a data table with rows and columns. The number of columns is normally also considered as the dimension of the data. A dimensionality reduction method basically tries to reduce the number of columns without losing much relevant information. On the other side, a clustering algorithm tries to group similar rows together so that the clusters preserve as much as possible information. In other words, the purpose of clustering algorithm is to reduce the number rows by replacing them with clusters.

In terms of "symmetry" we can say, dimensionality reduction algorithms reduce the number of columns whereas clustering algorithms reduce the number rows.

So, what does this symmetry brings us? One application of this symmetry is that we can transfer any clustering algorithm to a dimensionality reduction method (and vice versa) by transposing the data table as a matrix. For instance, we can apply a clustering method on the columns of dataset to get three clusters and use the centroids of the clusters as 3D coordinates for the rows. By doing so we reduce the dataset's dimension to 3.

No comments: