Comparing libraries with Compr
This manual page describes how to compare libraries with the Compr tool:
Introduction
Compr compares two sets of objects (like compound libraries) using diversity and dissimilarity calculations.
-
Comparing all individual compounds of a set with a library.
-
Comparing two libraries.
-
Library self-dissimilarity test: comparing all individual compounds of a set with the rest of the compounds.
This document mentions molecules as the entities to be compared, but the software can also be used for other types of objects.
The algorithm applies nearest neighbor searching that finds molecules similar to the query object. The calculation applies the Tanimoto (or Jaccard) coefficient that is calculated by the following formula in the case of binary fingerprint (bit string) input:
T(A,B) = NA&B/(NA+NB-NA&B)
where NA and NB are the number of bits set in the bit strings of molecules A and B, respectively, NA&B is the number of bits that are set in both.
When only binary fingerprints are used for the calculation of the dissimilarity between molecules, then the formula of the dissimilarity of molecule A and B is
D(A,B) = 1-T(A,B)
where T(A,B) is the Tanimoto coefficient for molecule A and B.
When other columns are also used, a weighted Euclidean distance calculation is applied:
D(A,B) = sqrt{[1-T(A,B)] + w1[C1(A)-C1(B)]2 + w2[C2(A)-C2(B)]2 + ...}
where
-
w1, w2, ... are weights
-
T(A,B) is the Tanimoto coefficient for molecule A and B
-
Ci(A) is the value of descriptor i of molecule A.
-
sqrt is the square root function.
Instead of the brute force method, Compr applies heuristics to avoid calculating all pairwise dissimilarity calculations and neighbor list comparisons.
Usage
You can use Compr by using the following command:
compr <options>
Prepare the usage of the compr script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts, or call the Compare class directly. This can be done in the following ways.
-
Win32 / Java 2 (assuming that JChem is installed in c:\jchem):
java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" chemaxon.clustering.Compare <options>
-
Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):
java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
chemaxon.clustering.Compare <options>
Because the utility has many parameters, it may be reasonable to create a shell script or a batch file for calling the software.
Options
General options:
-h --help this help message
-d --driver <JDBC driver> JDBC driver
-u --dburl <url> URL of database
-l --login <login> login name
-p --password <password> password
-P --proptable <tablename> name of property table
-s --saveconf save settings into ~/.jchem
Input options (default: standard input):
-i --input <path1> <path2> input files to compare (text file input)
-q --query <sql1> <sql2> SQL query strings for reading input
(database input)
Combined input (table vs. file comparison):
-i <path> -q <sql>
Output options (default: standard output):
-o --output <filepath> output file path (text file output)
-a --statement <sql> SQL statement for inserting results
(database output)
-z --stat print statistics on overall dissimilarity
-Z --only-stat statistics only on overall dissimilarity
-y --only-dissimilar print only dissimilar objects
-L --list-similar list similar objects from the first set
-v --verbose verbose output
Data properties
-m --dimensions <dim> number of floating-point descriptors
-f --fingerprint-size <bits> binary fingerprint size in bits.
fpsize should be a multiple of 32
-w --weights <w1> <w2> ... the weights of the floating-point descriptors
-g --generate-id generate id for each compound.
Conditions of calculation
-t --threshold <threshold> maximum dissimilarity of two compounds
compounds (default: 1)
-D --different-ids don't compare compounds with the same id
-c --max-count <count> maximum number of similar objects listed
-r --order sort similar objects by distance (closest first)
Without a valid license key, the software is in demo mode and maximum 1000 structures can be retrieved from the database.
Loading/Saving of Settings
It would be inconvenient to enter all of the parameters of the compr script at each run. To overcome this problem, it is possible to save some of the settings that are not changed frequently in the .jchem file stored in the user's home directory. Use the --saveconf option to store the following settings:
-
JDBC driver's class name (--driver)
-
JDBC URL of database (--dburl)
-
Login name (--login)
-
Password (--password)
-
Fingerprint size (--fingerprint-size)
The settings needed for the database connection are also modified and saved by JChemManager. If you successfully entered into the database using JChemManager, then you don't need to set connection for Compr manually.
Database Connections
For more information on setting connection parameters:
-
JDBC driver's class name (--driver)
-
JDBC URL of database (--dburl)
-
Login name (--login)
-
Password (--password)
-
Property table (--proptable)
please visit the Administration Guide of JChem.
Input
Two data sources are retrieved, which contain the data of molecule libraries to be compared. The software may import data from either text files (--input) or database tables (--query). Each input data source must contain the following columns:
Columns |
Type |
Content |
Id |
Integer numbers |
Id of compounds |
fp1, fp2, fp3 ... |
Integer numbers |
Binary fingerprints in integer number blocks |
d1, d2, d3, ... |
Floating point numbers |
Other descriptors |
Comments:
-
Pharmacophore fingerprints can be generated using the GenerateMD tool. These fingerprints are not binary, so they have to be specified as other descriptors. See an example for combining GenerateMD and Jarp using pharmacophore fingerprints.
-
At least one binary fingerprint column or descriptor column is required.
-
Use the --generate-id option if the id column is missing from the input data.
-
Text input files can be created using the GenerateMD application. For example
generatemd c structures.smi -k CF -c cfp.xml -D -o fingerprints.txt
A sample XML configuration file (cfp.xml) can be found in the config directory under the examples directory.
-
In the case of text input, the delimiter between two numbers should be space or tab (comma is not allowed).
-
The cd_id and cd_fpi columns in JChem's structure tables are appropriate as input.
-
In the case of database input, an SQL select statement is needed to retrieve the columns. For example
compr -q "SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6 FROM structures" ...
(For the sake of readability only 6 fp. columns are applied in the above example, but usually this number is much higher.)
-
It is important to place the query statement between quotes because it contains spaces.
Output
The software can write the results of the calculation into either a text file (--output) or a database table (--statement).
Each row of the output belong to a compound in the second set. The following symbols are used in the description of columns:
L1, L2 |
The compound libraries specified in this order in the call of Compr |
C |
The compound from L2, which belongs to the specific row |
D(C,Ai) |
The dissimilarity between C and compound i from L1 |
The exported data contains a subset of the following columns:
Columns |
Type |
Content |
Id |
Integer numbers |
The identifier of C |
minD |
Floating point numbers |
The dissimilarity of C and its nearest neighbor from L1: min(D(C,Ai)) |
nneib |
Integer numbers |
The nearest neighbor of C in L1 |
simcnt |
Integer numbers |
The number of objects in L1, which are similar to C. (Their dissimilarity is lower than or equal to the specified threshold value.) |
avgD |
Floating point numbers |
The average dissimilarity of C and all compounds from L1: sum(D(C,Ai))/N(L1) |
maxD |
Floating point numbers |
The maximum dissimilarity between C and all compounds from L1: max(D(C,Ai)) |
list_of_similar_objects |
Integer numbers in several columns |
The list of molecules with a dissimilarity below the threshold (neighbors) in one row. |
Only the column "Id" is printed if the --different-ids option is specified. "avgD" and "maxD" columns are written only if either the --statistics or the --only-statistics option is given. The "list_of_similar_objects" columns are printed if the --list-similar option is specified.
Comments for database output:
-
A precondition of database output is the existence of a database table that contains the above columns. Create the database table before starting the calculation.
Examples for table creation:-
If the result will contain the id values of dissimilar objects (--different-ids option is specified)
CREATE TABLE compr_result (
cd_id INTEGER NOT NULL PRIMARY KEY) -
If the result will contain objects of the second library, which are similar to objects in the first library (none of --different-ids, --statistics, and --only-statistics is specified)
CREATE TABLE compr_result (
cd_id INTEGER NOT NULL PRIMARY KEY,
minD FLOAT,
nneib INTEGER,
simcnt INTEGER) -
If the result will contain all objects with all details (--statistics is specified)
CREATE TABLE compr_result (
cd_id INTEGER NOT NULL PRIMARY KEY,
minD FLOAT,
nneib INTEGER,
simcnt INTEGER,
avgD FLOAT,
maxD FLOAT)
-
-
Before starting the calculation, make sure that the table is empty. The SQL DELETE statement may be applied for deleting the rows in a database table.
Example for deleting all rows:DELETE FROM compr_results;
-
In the case of database output, an SQL statement is needed to be specified for Compr (-a option), which inserts the rows containing the results.
Example:compr -a "INSERT INTO compr_results(cd_id, minD, nneib, simcnt) VALUES(?,?,?,?)" ...
The "?" symbols will be substituted with the corresponding values.
-
If the table is filled with the results, the rows may be retrieved using SQL SELECT statements.
Example:SELECT * FROM compr_results WHERE minD > 0.05
-
It is important to place the import statement between quotes because it contains spaces.
-
In the case of database output, don't use the --list-similar option, because in then the number of columns is not fix.
Diversity Statistics
Optionally, Compr can print diversity statistics into the standard output or the given output file. The parameters that enable statistics printing are --statistics or --only-statistics. (The latter one doesn't allow to print information on individual compounds.) The following data will be printed:
-
Number of objects in set 1
-
Number of objects in set 2
-
Diversity measures:
-
minimum dissimilarity between sets
-
average dissimilarity between sets
-
maximum dissimilarity between sets
-
The calculation is significantly slower if statistics is enabled, since all pairwise dissimilarity values have to be calculated. (Heuristics cannot be applied.)
Comments on Some Parameters
--fingerprint-size
The number of binary fingerprint columns multiplied by 32 (because the bit-length of integer numbers is 32 in Java)
--dimensions
Specifies the number of other columns. If only binary fingerprints are used in the clustering process, then this parameter doesn't have to be set.
--weights
When other columns are used, a weighted Euclidean distance calculation may be applied. If there are also binary fingerprint columns, weights are relative to the Tanimoto coefficient calculated from the binary fingerprints (the Tanimoto coefficient has a weight of 1.0).
--threshold
Compounds with a dissimilarity below the threshold will be considered similar.
--different-ids
Compounds with the same id will not be compared. This is useful in any of the following cases:
-
The two compound sets are not disjunct (some of the compounds are the same).
-
If self dissimilarity is tested, that is, when the two compound sets are the same.
The precondition to use this option is that the id values of the same compounds are the same.
By default, the heap size in some Java runtime environments is limited to 64MB, so you may run out of memory easily. See the FAQ on increasing the heap size.
Examples
In the examples it is supposed that, when needed, all connection parameters are set and stored by JChemManager (or a previous saving by Compr).
-
A batch file (Windows) for reading from a database and writing the id of all compounds in the commercial_library table that are similar to the structures in the home_library table to the standard output:
set QUERY1="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM home_library"
set QUERY2="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM commercial_library"
compr -q %QUERY1% %QUERY2% -t 0.1 -f 512 -D
-
A UNIX shell script for reading binary fingerprints from two database tables (home_library and commercial_library) and insert the results into another table. It collects id-s and calculates dissimilarity information on structures in the commercial_library, which are similar to some structures in the home_library:
QUERY1="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM home_library"
QUERY2="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM commercial_library"
INSERT="INSERT INTO simil(cd_id, minD, nneib, simcnt) VALUES(?,?,?,?)"
compr -q "$QUERY1" "$QUERY2" -a "$INSERT" -t 0.1 -f 512
Make sure that the simil table exists and is empty before running the script.
-
Full statistics calculation using the output of GenerateMD (generated id values are needed):
generatemd c home_library.smi -k CF -c cfp.xml -D -o home_library.fp
generatemd c commercial_library.smi -k CF -c cfp.xml -D -o commercial_library.fp
compr -f 512 -t 0.1 -g -z -i home_library.fp commercial_library.fp > compr_result.tab
Example for the result file:
id minD nneib simcnt avgD maxD
1 0.165 40092 0 0.531 0.770
2 0.056 8998 1 0.546 0.766
....
999 0.095 365 1 0.521 0.804
1000 0.103 221 0 0.527 0.821
STATISTICS
Number of objects in set 1 = 90000
Number of objects in set 2 = 1000
Minimum dissimilarity between sets = 0.0057803392
Average dissimilarity between sets = 0.543819
Maximum dissimilarity between sets = 0.82962966 -
Self-dissimilarity test of the home library:
compr -f 512 -t 0.1 -g -z -i home_library.fp home_library.fp > compr_result.tab
-
Displaying the structures and the results of a full statistics calculation using the CreateView and MarvinView applications:
-
Creating an SDfile containing the calculated data (the minD, avgD, and simcnt columns) and the structures:
crview -i id -d "minD:avgD:simcnt" -s commercial_library.sdf -t compr_result.tab > compr_result.sdf
-
Displaying the structures and the following data: CAS_RN (comes from the original SDfile), minD, avgD, and simcnt:
mview -r 3 -f "CAS_RN:minD:avgD:simcnt" compr_result.sdf
A screenshot of the results shown by MarvinView is the following:
-