Thursday, September 4, 2008

On similarity metrics for chemical compounds

Recently, Yap Chun Wei has posted a dataset on the pharmine blog. The dataset consists of fingerprints of 171 pharmacologically interesting compounds. Just to recapture, the fingerprint of a compound is here a vector of 1025 binary flags, each flag in the vector indicates whether a particular molecular fragment is present in the compound. There are many ways to calculate fingerprints. Depending on the nature of the problem, you can use different algorithms or different collection of fragments. The mentioned dataset, for instance, used OpenBabel to calculate those fingerprints.

The dataset constains 3 different groups of compounds: penicillins, cephalosporins, fluoroquinolones. Using VisuMap Yap created different maps of these 171 compounds which showed, more or less, the cluster structure of the dataset. I personally find that the PCA map provides the best visualization. The following picture is a PCA map I created with VisuMap for this dataset:

Compound Map of 171 compounds

In above map, the 3 compound groups are displayed as glyphs in 3 different colors. The coloring are done manually with VisuMap based on known information about these groups. Although, for this dataset, you can get almost exactly these 3 clusters using the k-mean algorithm provided by VisuMap. The bar diagram in the picture shows the presence frequency of the 1025 fragments among the 171 selected compounds. That means, a higher bar indicates that a particular fragment is present in more compounds.

The above map visualizes the similarity information between the 171 compounds. That means, closely located compounds will have similar fragment collections and therefore similar pharmacological properties. The similarity information are basically encoded in the fingerprints. Thus, the method to calculate of those fingerprints is naturally critical for this kind of data analysis.

In order to better understand those fingerprints we can created a map of those 1025 features with VisuMap. In order to do so, we simply transpose the binary data table (via the menu Edit>Filter Data>Transpose Table in VisuMap), so that each binary feature becomes a new data point; and each compound become a feature in the transposed dataset. We can then pick a mapping algorithm and metrics to create a feature map. The following picture shows such a map created with the t-SNE algorithm and the tanimoto dissimilarity metric:

Feature map of 1025 binary features

Above picture shows 4 or 5 clear clusters on the left side represented by colored glyphs. The rest are more or less randomly distributed. It turned out that those yellow-square features are those fragments which are NOT present in any of those 171 compounds (all bits are zero). Therefore, they carry no direct information about our compounds. Interestingly, these zero vectors form together a homogeneous cluster in the map.

Other clusters in above map represent groups of fragments which have high frequency and are informative to distinguish the three compound groups in the original dataset. We can verify this with the help of the bar diagram in VisuMap as follows:

We first open the feature map in a separate window (via the menu Tools>Map Snapsot). Then open the compound map, and then select all compounds and open the bar view through the context menu "Bar View". The bar view by default displays the frequency in the order as given in the transposed data table. We then sort the bars through the context menu "Sort Values" so that bars are displayed in the order from low frequency to high frequency as depicted in the following two picture:

The sorted bar view shows 3 plateaus which correspond to clusters in the feature map and in the compound maps. In order to see the correspondence we select a plateau in the bar view with the mouse, the snapshot window of the feature map will automatically high light those selected features. As we can see in the following picture, the selected plateau of features clearly correspond to a particular cluster in the feature map (the marked cluster at lower left corner).

Correspondence between high frequency features and feature clusters

We notice that some of the clusters in the feature map show some fine sequential structure that may lead to more hints about the internal structure of the fragment collections.

With the knowledge about informative feature clusters we can, for instance, reduce the number of features significantly without significant loss of information about the clusters in the original dataset. The VisuMap dataset folder PharMine171.xvmz (zipped XML file) includes a reduced dataset with 298 features that characterizes similar similarity information as the original dataset with 1025 features.

The above VisuMap dataset folder also contains feature maps created with other mapping algorithms. It is interesting to notice that for the feature map dataset, the t-SNE algorithm provides the best visualization, whereas the result of the PCA algorithm is rather disappointing.

No comments: