K-means clustering

This manual describes the k-means clustering algorithm:

Introduction

ChemAxon developed a derivative of k-means algorithm: a rapid, non-hierarchical method wherein the number of clusters is specified and the clusters are seeded with an initial structure. Additional structures are added to clusters based on similarity and then structures may be relocated to other clusters iteratively, again based on similarity. The final clusters depend on the ordering of structures as they are placed in clusters. Clusters will tend to be globular and it must be specified a priori how many clusters are desired.

The method works as follows: The user selects a number of clusters (K) and provides a set of K initial modals to act as starting seeds. An initial classification pass occurs: Each item in the input dataset is assigned to a cluster such that the distance from that item to the modal for that cluster is minimized. As each item is added the modal for the cluster is recomputed to include the new item. Zero or more relocation passes occur: Each item in the input dataset is checked against the neighbouring modals. If it is closer to another cluster modal, it is moved from the current cluster to the new cluster. The modals for both clusters are recomputed. Typically relocation passes occur until no items relocate over an entire pass. When complete, each of the K clusters is output with its member items and the final modal fingerprint for that cluster.

This method is much less resource intensive than Jarvis-Patrick; the order of the algorithm is O(K*N), which is typically much less than the O(N2) of Jarvis-Patrick. Therefore it is a good choice for larger datasets and in situations where one wants to select a specific number of sample items from a dataset (K). Also, it is a flexible method for simple partitioning and classification operations to select weighted subsets of a dataset rapidly. Note that, unlike Jarvis-Patrick, modified k-means does depend on the input order of the dataset.

Usage

You can invoke the k-means algorithm from the jklustor command line tool:

jklustor [<options>] [<input files>]

Prepare the usage of the jklustor script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts.

Options

   kmeans[:cluster count]       K-means clustering
-h, --help help message
-c, specify the clustering method
-o, --output <filepath> output file path (default: stdout)
-t, --tag name of the SDFile tag to store the
Pharmacophore Map (default: PMAP)
-S, --sdf-output SDF output (otherwise only PMAP list)
-g, --ignore-error continue with next molecule on error
-v, --verbose print calculation warnings to the console

Examples

Invoke k-means clustering resulting in 10 clusters:

jklustor -c kmeans:10 testdata.sdf