Diverse Set Selection

In this manual you can find the description of the Diverse Set Selection clustering algorithm:


Diversity, in the form of chemical diversity, is a very important concept in several areas of scientific research, and calculation of diversity is one of the most important considerations in pre-clinical drug discovery research and, in particular, in design of diverse chemical libraries for combinatorial chemistry and compound selection for High Throughput Screening (HTS). ChemAxon’s diversity selection facilitates sampling of a given data matrix to obtain the most diverse compounds that span the entire descriptor space. Algorithm of diversity sorting From an initial dataset most dissimilar molecule pairs are selected based on a given descriptor set. Then diversity of dataset with remaining compounds is calculated. From the remaining compounds a structure, which is most dissimilar to selected data set is picked. Steps 2-3 are repeated until all compounds will be exhausted or diversity sorting will be aborted by the user. Diversity selection is time consuming procedure. The time required for the calculations increases with the square of the size of the dataset. The method should work fine on not too diverse molecule sets; otherwise it will create many outliers.


You can use the Diverse Set Selection algorithm via the jklustor command line tool:

jklustor [<options>] [<input files>]

Prepare the usage of the jklustor script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts.


mmds[:] Use Maximal-Minimal Dissimilarity Selection based clustering. Targeted cluster count
-h, --help help message
-c, --config <filepath> path of the XML configuration file
-o, --output <filepath> output file path (default: stdout)
-t, --tag name of the SDFile tag to store the Pharmacophore Map (default: PMAP)
-S, --sdf-output SDF output (otherwise only PMAP list)
-g, --ignore-error continue with next molecule on error
-v, --verbose print calculation warnings to the console


Diverse subset selection resulting 5 clusters can be called as:

jklustor -c mmds:5 input.sdf