API Usage

This manual serves as API Developer and User Guide of JChem PostgreSQL Cartridge (JPC). See also Getting started guide for easy setup and use cases.

If you are familiar with JChem Oracle Cartridge (JOC) and are interested in the differences between JOC and JPC, see their comparison here.

CREATE TABLE

The type of the column where the chemical structures are stored has to be any of the MOLECULE types defined in /etc/chemaxon/types/.

Prerequisites

  • jchem-psql service must run

  • the extension named chemaxon_type must be created.

CREATE TABLE table_name (structure_column_name MOLECULE('molecule_type_name'));

molecule_type_name must be specified in order to make the column searchable!

Example:

CREATE TABLE ttest (mol MOLECULE('sample'));

The created column can be propagated with any of the well-known chemical structural data file formats. Due to postgres type system, the provided string value is automatically converted to MOLECULE type. For importing different molecule formats from local files, see Auxiliary functions for importing into a table.

If you will run updates against a table which contains indexed molecule columns,
to reduce postgresql MVCC index entry creation/deletion traffic, postgres should be able to use HOT.
In order to facilitate the usage of HOT on your tables, use the 'fillfactor' storage argument. Read about fillfactor.

Example:
CREATE TABLE t (m molecule('sample')) WITH (fillfactor=50);

The next chart illustrates how the fillfactor influences the consumed time of the calculation of the molecular weight of 1 M molecules in columns indexed with the domain index - chemindex - applied in JChem PostgreSQL Cartridge.

Time of the MolWeight calculation in 1M datasets in chemindexed tables created with different fillfactor values

images/download/attachments/51023395/fillfactorEffect.jpg

The Reference is the same 1 M dataset, without chemindex, in a table created without fillfactor.

It can be seen that in indexed tables, the speed of the calculation executed in the non-indexed reference table can only be achieved if the fillfactor is set to 50. The 50 % value of the fillfactor means that the memory requirement of the table is approximately doubled compared to the 100 % (or no fillfactor).

Fillfactor 50 is recommended if calculated columns are planned to be added to chemindexed tables.

If no more calculations are planned, the fillfactor can be reset even to 100 by:

ALTER TABLE table_name SET ( fillfactor = 100);
VACUUM FULL table_name;

The possibility of dropping the chemindex before such calculations, and creating them after the calculation is always available.

MOLECULE INDEX

Indextype named chemindex has to be used when indexing a column containing chemical structures.
You can check whether this indextype exists or not:

SELECT * FROM pg_am WHERE amname='chemindex';

Indextype chemindex can be used in the following way:

CREATE INDEX index_name ON table_name USING chemindex(structure_column_name);

Example:

CREATE INDEX ttest_idx ON ttest USING chemindex(mol);

Using the index in any type of searches can be forced by setting enable_seqscan parameter to OFF value in postgres configuration file /etc/postgresql/9.5/main/postgresql.conf).

We suggest closing all transactions and running a VACUUM before creating an index to avoid the calculation of the index data on rows that are being changed in the current transaction. E.g. if a column was added in the current transaction, the index creation time can be almost double of the normal time.
The same logic applies to updating a large number of molecules in an indexed table and then search on the table without closing the transaction and clearing up modification with VACUUM. If possible update operations and column additions are advised to be performed before adding the chemical index.

MOLECULE IMPORT

All well known chemical structural data file formats are supported, but we provide functions for importing mol, sdf files.

In case of big files, the postgresql client can run out of memory as the following error message displays:

ERROR: out of memory
DETAIL: Cannot enlarge string buffer containing 0 bytes by 1506367917 more bytes

Please, split up the original input file into smaller pieces and import them one after another. (Here is a script tool provided for splitting SD files.)

For importing purposes, usage of scripts consisting of multiple SQL insert statements is not recommended as this operation may be time consuming (especially if the destination table contains a chemical index). To speed up import, it is advisable to use the standard ‘copy’ SQL function or the parse_sdf ChemAxon utility.

Import from SD file

Import from local SD/mol files containing added fields besides the chemical structure data

The content of all the added fields in the SD file will be stored in a hashmap like type column.Optionally selected added fields from the SD file can be stored in separate columns as well.

Follow the next steps:

  1. Set the content of the SD file to a variable (e. g.: content)

    \set variable_name `cat sdf_file_name`

    Example:

    \set content `cat ~/a.sdf`

    You can check the table generated by the parse_sdf function

    SELECT * FROM parse_sdf(:'content');

    The parse_sdf function returns the content of the sdf file as a table, which has two columns:

    molSrc = the full sdf source of the original entry in the sdf file

    props = hstore type column, which contains the properties of the sdf entry

    You can also create a table similar to the table created by the parse_sdf function by:

    CREATE TABLE tableName (mol MOLECULE('sample'), p hstore);
  2. Create a table with create table as:

    CREATE TABLE table_name AS
    SELECT molSrc::molecule('molecule_type_name') AS structure_column_name[, props ->'sdf_field_name' AS column_name]
    FROM parse_sdf(:'');

    Example:

    CREATE TABLE mysdftable AS
    SELECT molSrc::molecule('sample') AS mol,
    props -> 'MOLFORMULA' AS formula, CAST(props -> 'CDID' AS INTEGER) AS id
    FROM parse_sdf(:'content');

    where

    mysdftable = the name of the table

    mol = the name of the column for storing structural data

    molecule('sample') = the type of column mol

    sample = molecule type (must be a defined type in /etc/chemaxon/types/ folder)

    MOLFORMULA = the name of the field in the SD file for storing chemical formula

    formula = the name of the column for storing chemical formula

    CDID = the name of the field in the SD file storing the identifier

    id = the name of the column for storing the identifier integer

    You can also store all properties from the SD file entry in a separate column in the structure table, and later, optionally, add additional columns to separately store selected properties.

    Example:

    \set content `cat ~/a.sdf`
     
    CREATE TABLE mytable AS
    SELECT molSrc::molecule('sample') AS mol,
    props AS properties
    FROM parse_sdf(:'content');
     
    ALTER TABLE mytable ADD COLUMN formula TEXT;
     
    UPDATE mytable SET formula = properties -> 'MOLFORMULA'

Collect invalid molecules

Available from version 1.8.

Running the steps below, the molecules in SD/mol files will be checked during the import and the invalid molecules will be stored in a separate table named <new_table_name>_error.

Steps:

  1. Run the import_sdf.sql script to create function import_sdf. This script file is a customizable tool; you can update it according to your needs.

    \i import_sdf.sql;

    Usage of function import_sdf is not recommended in case of tables already containing indexed chemical data as this operation may be time consuming. To speed up import, in this case a temporary table without indexes can be created with import_sdf and then the content of this temporary table can be inserted into the destination table using the following SQL statement:

    insert into <destination table name> select (tt.mol, tt.props, …) from <temporary table name> tt;
  2. Run the following commands including import_sdf function with the appropriate parameters:

    \set <variable_name> `cat <path_to_your_sdf_file>`;
    SELECT import_sdf(:'<variable_name>', '<new_table_name>', '<molecule_type>');

    Example:

    \set content `cat ~/a.sdf`;
    SELECT import_sdf(:'content', 'molecules', 'sample');

Import from (cx)smiles or (cx)smarts files

Import from (cx)smiles or (cx)smarts files by the standard copy sql function

COPY table_name (structure_column_name) FROM 'file_name' (FORMAT csv);

Example:

COPY ttest(mol) FROM '/home/posgresuser/targetfiles/nci-pubchem_1m_unique.smiles' (FORMAT csv);


Import from(cx)smiles or (cx)smarts files while collecting the invalid molecules

Available from version 1.8

Running the steps below, the molecules in smiles/cxsmiles/smarts/cxsmarts files will be checked during the import and the invalid molecules will be stored in a separate table named <new_table_name>_error . We propose the following two methods: PL/pgSQL script method and SQL commands method.

PL/pgSQL script method

We suggest applying this script method when the server and the client are on the same machine; or at least the molecule files are available on the server and their absolute path is known. Otherwise, apply the steps of SQL commands method.

Steps:

  1. Run the import_single_line_format.sql script to create function import_single_line_format. This script file is a customizable tool; you can update it according to your needs.

    \i import_single_line_format.sql;
  2. Run the following SELECT using import_single_line_format function with the appropriate parameters:

    SELECT import_single_line_format('<absolute_path_to_my_file_on_server>', '<new_table_name>', '<molecule_type>');

    Example:

    SELECT import_single_line_format('/home/myuser/molecules.smiles', 'molecules', 'sample');

SQL commands method

  1. Create a table which will contain the valid molecules

    CREATE TABLE my_table(mol TEXT);
  2. Import from file

    \COPY my_table FROM '~/molecules.smiles' (FORMAT csv);
  3. Create and load a table for the invalid molecule sources

    CREATE TABLE my_table_error AS SELECT mol FROM my_table WHERE NOT is_valid_molecule(mol);
  4. Remove invalid molecules from my_table

    DELETE FROM my_table WHERE mol IN (SELECT mol FROM my_table_error);
  5. Convert the valid molecule sources into molecule

    ALTER TABLE my_table ALTER COLUMN mol TYPE molecule('sample') USING mol::molecule('sample');

Convert molecule source text to molecule from an existing table

Follow the steps (starting from step 3) of the SQL commands method above.

MOLECULE FUNCTIONS

Casting a string to any Molecule type defined in /etc/chemaxon/types/

'CCCC'::Molecule('sample')

Postgres type system can cast a string to molecule if it is obvious; if the operation needs a molecule as an input. There are cases when there is no need of explicit casting because of the auto cast mechanism. See details here.

Example:

SELECT 'C' |<| 'CC'::molecule('sample');

Interpreting a string using a specific format

As molecule strings are ambiguous in some cases, it is possible to interpret the given molecule string according to the given molecule format using the molecule(String, String) function:

molecule(structure_string, molecule_format)

where
structure_string = molecule string representation
molecule_format = molecule format string, see file formats

SELECT * FROM table_name WHERE molecule(structure_string, molecule_format) |<| column_name;

A typical usecase is the differentiation between SMILES and SMARTS strings:

  • 'CC' with SMILES notation is interpreted as two aromatic or aliphatic carbon atoms connected with a single bond.

  • 'CC' with SMARTS notation is interpreted as two aliphatic carbon atoms connected with a single bond.

Example:

SELECT * FROM ttest WHERE molecule('CC', 'smiles') |<| mol;
SELECT * FROM ttest WHERE molecule('CC', 'smarts') |<| mol;

Transformations

Transformation function is provided to transform the molecule structure according to specific needs. Multiple transformation strings can be provided separated by two dots.

query_transform(molecule, 'transformation_string1..transformation_string2')

where

molecule = "molecule string" | Molecule object | table column
transformation string = "fullfragment" | "dbsmarkedonly"

Molecule transformation combined with substructure search:

SELECT * FROM ttest WHERE query_transform(molecule, 'transformation string') |<| mol;
  • Double bond stereo option: dbsmarkedonly

By default, CIS query matches only with CIS target and TRANS query matches only with TRANS target. If matching of CIS query with both CIS and TRANS targets or matching of TRANS query with both CIS and TRANS targets is aimed, then the query molecule has to be transformed.
The query_transform function with dbsmarkedonly option is provided to accomplish this transformation.

query_transform('query_structure', 'dbsmarkedonly')

Examples:

SELECT * FROM ttest WHERE query_transform('C\C=C\C', 'dbsmarkedonly') |<| mol;
SELECT * FROM ttest WHERE query_transform('C\C=C\C', 'fullfragment..dbsmarkedonly') |<| mol;
  • Full fragment (exact fragment) search option: fullfragment

Full fragment search can be executed by transforming the query structure in a way that it matches only a full fragment of the target structure and by executing a substructure search on this modified query structure. The transformation adds 's*' query search property to all atoms.

query_transform(query_structure, 'fullfragment')

Examples:

SELECT * FROM ttest WHERE query_transform('C1CCCCC1', 'fullfragment') |<| mol;

Fingerprint generation

Fingerprints can be generated using the following function:

fingerprint(chemical_structure,size)

or

fingerprint(structure_column_name,size)

where size is the number of bits in the fingerprint bit string; it must be divisible by 32, e.g.: 512.

Example:

SELECT fingerprint('C1CCCCC1',512);

Chemical terms

(Available since version 1.4.)

Calculation of chemical terms

Function chemterm makes possible to calculate chemical terms.

You may need to purchase separate licenses to apply function chemterm depending on the Chemical Terms used.

chemterm('chemical_term','structure')

where

structure = a molecule string

chemical_term = a chemical term function

The output of function chemterm is one of the following formats:

  • string

  • molecule string in format mrv

Examples:

SELECT chemterm('name','CCO');
SELECT chemterm('mass()','CCO');
SELECT molconvert(chemterm('canonicalTautomer()','CC(O)=C')::Molecule,'smiles');

Addition of chemical term columns

Chemical term columns can be added by using triggers.

See the following code block as an example:

CREATE TABLE test(structure MOLECULE('sample'), molweight NUMERIC);
 
CREATE OR REPLACE FUNCTION set_molweight()
RETURNS trigger AS
$BODY$
BEGIN
IF NEW.structure is null THEN
NEW.molweight:=NULL;
END IF;
NEW.molweight:=chemterm('mass()',NEW.structure)::real;
RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;
 
DROP TRIGGER IF EXISTS tr_molweight ON test;
CREATE TRIGGER tr_molweight BEFORE INSERT OR UPDATE ON test
FOR EACH ROW EXECUTE PROCEDURE set_molweight();
UPDATE test SET molweight=chemterm('mass()',structure)::real;

Molconvert

Conversion to molecule formats

The use of ChemAxon's MolConverter is supported with some limitations:

molconvert('structure', 'format')

where

structure = a Molecule in any of the following formats

format = mrv, mol, rgf, sdf, rdf, csmol, csrgf, cssdf, csrdf, cml, smiles, cxsmiles, abbrevgroup, sybyl, mol2, pdb, xyz, inchi, or name

Example:

SELECT molconvert('CC','sdf');

Conversion to base64 encoded binary formats (image)

Molecules can be converted to binary image formats (png, jpeg, msbmp, pov, svg, emf, tiff, eps) or other binary formats (pdf) in Base64 encoded form.

Example:

SELECT molconvert('CC','base64:png');

Molecule validation

Available from version 1.8.

The following function is provided to check the validity of the molecules:

is_valid_molecule(structure_text)

where

structure_text = a Molecule in any of the following formats:

mrv, mol, rgf, sdf, rdf, csmol, csrgf, cssdf, csrdf, cml, smiles, cxsmiles, abbrevgroup, sybyl, mol2, pdb, xyz, inchi, or name

Example:

SELECT is_valid_molecule('CCO');

You can filter out the invalid structures from a table using the following SELECT statement:

SELECT structure_text from mytable where is_valid_molecule(structure_text) = 'f';

MOLECULE OPERATORS

The main features of the different search types are available in JChem Query Guide.

The molecule operators work also in tables without chemical index, however, chemical index makes the search operations much faster.

Substructure search

Substructure search is performed using the symmetrical sub-/super-structure search operator: |<|.

SELECT * FROM table_name WHERE query_structure |<| structure_column_name;
SELECT * FROM table_name WHERE structure_column_name |>| query_structure;

where

query_mol = "molecule string" | Molecule object | table column
target_mol = "molecule string" | Molecule object | table column

Examples:

SELECT '[#6]-[#6]' |<| 'CC'::Molecule('sample');
 
SELECT * FROM ttest WHERE 'c1ccccc1' |<| mol;
 
SELECT * FROM ttest WHERE 'testmol
Mrv0541 01211514572D
 
3 2 0 0 0 0 999 V2000
1.2375 -0.7145 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9520 -1.1270 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6664 -0.7145 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
M END
' |<| mol;

See also molecule functions about defining the query structure format.

Superstructure search

Superstructure search is performed using the sub-/super-structure search operator: |<|.

SELECT * FROM table_name WHERE query_structure |>| structure_column_name;
SELECT * FROM table_name WHERE structure_column_name |<| query_structure;

Example:

SELECT * FROM ttest WHERE 'CCC' |>| mol;

Full fragment (exact fragment) search

Full fragment search can be executed by transforming the query structure in a way that it matches only a full fragment of the target structure and by executing a substructure search on this modified query structure.
See details in transformations for full fragment search.

SELECT * FROM table_name WHERE query_transform(query_structure, 'fullfragment') |<| structure_column_name;

Example:

SELECT * FROM ttest WHERE query_transform('CC', 'fullfragment') |<| mol;

Duplicate search

Duplicate search is performed using the |=| operator.

SELECT * FROM table_name WHERE query_structure |=| structure_column_name;
SELECT * FROM table_name WHERE structure_column_name |=| query_structure;

Example:

SELECT * FROM ttest WHERE 'CCC' |=| mol;

Tautomer search

You have to apply such a molecule type which has tautomer = GENERIC tautomer mode.

How is chemical matching of the query and the target executed in tautomer search? The generic tautomer - representing all theoretically possible tautomers - of the target is matched with the query structure itself. This method is applied in substructure search, full fragment search, duplicate, and superstructure search.

Limitations:

  • SMARTS atoms and SMARTS bonds in the query structures are not supported.

  • The use of bond lists in query structures may slow down the search.

Similarity search

There are some differences between the similarity scores of the new method and of the old methods. The new method works on the standardized chemical structures while the old methods do not use the standardized format, only the original chemical structures. None of them supports query features on the query molecules, but they handle their presence differently.

New method

Available from JPC version 2.5. This new method provides better performance than the old methods, and there is no need to add extra fingerprint column to the table.

You can select those molecules and their similarity value from a table whose similarity value relating the given query structure is greater (or smaller) than a given similarity threshold value.

For this purpose the sim_filter or the sim_order type can be applied as shown below in the examples. The use of the sim_filter type results unsorted output, while the use of the sim_order type results sorted output.

For retrieving the most (or less) similar target molecules sorted by their similarity values, the sim_order type must be applied and the LIMIT n condition can also be useful for better performance if only the most similar (or dissimilar) molecules are required.

SELECT field1, field2, structure_column_name |~| 'query_structure' FROM table_name
WHERE row ('query_structure', similarity_value)::sim_filter operator structure_column_name;
 
SELECT field1, field2, structure_column_name |~| 'query_structure' FROM table_name
WHERE row ('query_structure', similarity_value)::sim_order operator structure_column_name LIMIT n; 

where

similarity_value is the similarity threshold value, a number between 0 and 1

operator can be |<~| (meaning similarity value less than or equal) or |~>| (meaning similarity value greater than or equal)

Examples:

SELECT mol FROM moltable WHERE row ('CCC', 0.8)::sim_filter |<~| mol;
SELECT mol, mol |~| 'CCC' FROM moltable WHERE row ('CCC', 0.8)::sim_filter |<~| mol;
SELECT count(*) FROM moltable WHERE row ('CCC', 0.8)::sim_filter |<~| mol;
 
SELECT mol, mol |~| 'CCC' FROM moltable WHERE row ('CCC', 0.8)::sim_order |<~| mol LIMIT 20;
SELECT mol, mol |~| 'CCC' FROM moltable WHERE row ('CCC', 0.2)::sim_order |~>| mol LIMIT 20;

Limitations:

  • Query structures with query features (like list atoms, query atoms, query bonds, ... ) are not supported.

Old methods

For similarity searches with the old methods - which are supported before version 2.5 - you must have a column filled with fingerprints. These fingerprints can correspond to molecules in another table (old method 1) or to molecules in the same table (old method 2). We strongly advise using the other table method (method 1) because of the performance. The fingerprint data are usually small and PostgreSQL database engine can store them in memory.

Fingerprints can be generated as described above.

It is advised to name the fingerprint column as fp - in order to make similarity search easily runnable from applications based on JChem PostgreSLQ Cartridge API.

The following statements should be used for running similarity search using old method 1:

CREATE TABLE molecule_table_name(structure_column_name MOLECULE('molecule_type_name'), id INTEGER);
CREATE INDEX molecule_table_idx ON molecule_table_name(id);
CREATE TABLE fingerprint_table_name AS
SELECT fingerprint(structure_column_name,512) fp, id FROM molecule_table_name;
CREATE INDEX fingerprint_idx ON fingerprint_table_name(fp);
CREATE INDEX fingerprint_id_idx ON fingerprint_table_name(id);
 
SELECT * FROM fingerprint_table_name
WHERE tanimoto(fingerprint(query_structure,512),fp) operator similarity_value;
SELECT * FROM (SELECT ft.*, tanimoto(fingerprint(query_structure,512),fp) AS tanimoto FROM fingerprint_table_name AS ft)
WHERE tanimoto operator similarity_value ORDER by tanimoto DESC;

where

operator can be <, <=, =, >, >=

similarity_value is a number between 0 and 1

fp is the recommended name of the fingerprint column

Example:

CREATE TABLE moltable(mol MOLECULE('sample'), id INTEGER);
CREATE INDEX moltable_idx ON moltable(id);

Insert molecules into moltable and/or create a trigger, then continue with the followings:

CREATE TABLE fptable AS SELECT fingerprint(mol,512) fp, id FROM moltable;
CREATE INDEX fptable_idx ON fptable(fp);
CREATE INDEX fptable_id_idx ON fptable(id);
 
SELECT * FROM fptable WHERE tanimoto(fingerprint('CCC',512),fp) > 0.9;
--to get the results in descending order by similarity
  SELECT t.id,t.tanimoto, moltable.mol FROM (SELECT fptable.*, tanimoto(fingerprint('CCC',512),fp) AS tanimoto FROM fptable) AS t, moltable
WHERE t.id = moltable.id AND tanimoto > 0.9 ORDER BY tanimoto DESC;

The following statements should be used for running similarity search using old method 2:

CREATE TABLE table_name(structure_column_name MOLECULE('molecule_type_name'));
ALTER TABLE table_name ADD COLUMN fp BYTEA;
UPDATE table_name SET fp = fingerprint(structure_column_name,512);
SELECT * FROM table_name
WHERE tanimoto(fingerprint(query_structure,512),fp) operator similarity_value;

Example:

CREATE TABLE simtable(mol MOLECULE('sample'));
ALTER TABLE simtable ADD COLUMN fp BYTEA;
UPDATE simtable SET fp=fingerprint(mol,512);
SELECT * FROM simtable
WHERE tanimoto(fingerprint('CCC',512),fp) > 0.9;

Please note that fingerprint values present in fingerprint columns must be recalculated when molecules are updated.

Known Issue

Update of tables with more than about 1000 records might be very slow.

You can create a trigger to update the fingerprint column in the fingerprint table (fptable) when new records are inserted into the molecule table (moltable).

Example:

CREATE TABLE moltable(mol MOLECULE('sample'), id INTEGER);
CREATE TABLE fptable AS SELECT fingerprint(mol,512) fp, id FROM moltable;
CREATE OR REPLACE FUNCTION set_fingerprint()
RETURNS trigger AS
$BODY$
BEGIN
IF (TG_OP = 'DELETE') THEN
DELETE FROM fptable WHERE id = OLD.id;
RETURN OLD;
END IF;
IF (TG_OP = 'UPDATE') THEN
IF NEW.mol is null THEN
UPDATE fptable SET id = NEW.id, fp = null where id = OLD.id;
END IF;
UPDATE fptable SET id = NEW.id, fp = fingerprint(NEW.mol, 512) where id = OLD.id;
RETURN NEW;
END IF;
IF (TG_OP = 'INSERT') THEN
IF NEW.mol is null THEN
INSERT INTO fptable (fp, id) VALUES (null,NEW.id);
END IF;
INSERT INTO fptable (fp, id) VALUES (fingerprint(NEW.mol, 512),NEW.id);
RETURN NEW;
END IF;
END;
$BODY$
LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS tr_fingerprint ON moltable;
CREATE TRIGGER tr_fingerprint BEFORE INSERT OR UPDATE OR DELETE ON moltable
FOR EACH ROW EXECUTE PROCEDURE set_fingerprint();

Limitations:

  • Query structures with query features (like list atoms, query atoms, query bonds, ... ) are not supported.

Reaction search

Available from version 2.3.

The following search types are supported not only for molecules but for reactions as well.

Connection handling during searching

After retrieving the desired hits the SQL connection needs to be closed in order to close the search on the jchem-psql server side. You can achieve this by having autocommit switch on or by calling commit explicitly.

Combination of structure query AND structure/non-structure query

  • WHERE condition referring to more than one column containing chemical structure data is not supported.

  • WHERE condition referring to one structure column and one or more columns containing non-chemical structure data is supported.

Relevance sorting

Available from version 1.8.

As in standard SQL, the user can order his results using ORDER BY commands.

For ordering search results, JChem PostgreSQL function relevance(Molecule) is provided, which gives back a numeric type value based on the atom counts and further topological features of the molecule.

It is suggested that relevance values be stored in the table for further query. It is also suggested that an index is created on the relevance column and further queries return their results ordered by the relevance value. This facilitates the usage of LIMIT <n> conditions as the most relevant hits are at the beginning of the result set. If the relevance column is created upon table creation or import the addition of the chemical index should also occur after adding the relevance column .

ALTER TABLE <mytable> ADD COLUMN <relevance_column> INT;
UPDATE <mytable> SET <relevance_column> = relevance(mol)::int;
CREATE INDEX <relevance_index> on <mytable>(<relevance_column>);
 
SELECT mol FROM <mytable> WHERE 'query_structure' |<| mol ORDER BY <relevance_column> LIMIT <n>;

Example (assuming <mytable> has column "mol" of type Molecule):

ALTER TABLE test ADD COLUMN relev INT;
UPDATE TEST SET relev = relevance(mol)::int;
CREATE INDEX relev_idx on test(relev);
 
SELECT mol FROM test WHERE 'c1ccccc1' |<| mol ORDER BY relev LIMIT 100;

Other chemical features/measures may also be used for ordering, the chemterm function provides help for their definition.

You can create a trigger to update the relevance column when new records are inserted into the table.

Example:

CREATE TABLE test (mol Molecule('sample'));
ALTER TABLE test ADD COLUMN relev INT;
 
CREATE OR REPLACE FUNCTION set_relevance()
RETURNS trigger AS
$BODY$
BEGIN
IF NEW.mol is null THEN
NEW.relev:=NULL;
END IF;
NEW.relev:=relevance(NEW.mol)::int;
RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS tr_relevance ON test;
CREATE TRIGGER tr_relevance BEFORE INSERT OR UPDATE ON test
FOR EACH ROW EXECUTE PROCEDURE set_relevance();
UPDATE test SET relev=relevance(mol)::int;

AUXILIARY FUNCTIONS

Debugging

Raising the log level of the psql-client for debugging purposes:

SET client_min_messages to debug;

Performance log

You can log the performance of the current session as:

SELECT * FROM perf_out();

You can clear the log as (available since version 1.4):

SELECT perf_reset();

Performance tuning of combined queries

Available since version 1.4.

The performance of searching with combined queries - when chemical structure query condition is combined with a non-structure query condition - might be improved by executing the following calibration steps.

  1. Prerequisite: the column structure_column_name in table table_name used for the calibration must be indexed using indextype chemindex.

  2. Calibration of cost factors

    select calibrate_cost_factors('table_name', 'structure_column_name', 'query_structure');

    where

    query_structure is an optional parameter. If not specified, chlorobenzene (Clc1ccccc1) is used by default.

    The applied query_structure is an important key factor of the calibration. The best specified query_structure has almost the same (+/- 10%) estimated selectivity in the given table as the number of its search hits. We recommend the next explain statement to run in order to get the estimated selectivity. The number of rows given in the output of the next statement is the estimated selectivity.

    explain select * from table_name where 'query_structure' |<| structure_column_name;

    In order to get better performance, you may need to increase the default_statistics_target value for PostgreSQL analyzer. Its default value 100 can be increased to maximum 10000.

  3. Application of the calibrated cost factors
    Please follow the suggested solutions given in the output of the select calibrate_cost_factors statement.

    Example:

To set the values permanently, add the lines

chemaxon.cost_slope_factor = -5.14e-07

chemaxon.cost_intercept_factor = 0.03866

to the PostgreSQL configuration file (e.g.: /etc/postgresql/9.5/main/postgresql.conf)

To set values only for the current session, execute the following commands:

SET chemaxon.cost_slope_factor = -5.14e-07;
SET chemaxon.cost_intercept_factor = 0.03866;

Performance tuning of searches in big tables

Available since version 1.8

Searching in big table can be a very time consuming operation if the available memory cannot store all the necessary data of the whole table. The memory management operations could increase the search times.

Performance of searches in big tables can be tuned by applying the following JVM heap size and PostgreSQL settings, furthermore, the SELECT statements used for querying can be refined.

  1. JVM heap size settings
    In order to ensure that the biggest table fits into the memory, the approximate minimum heap size can be calculated as:

    heap size > <number of rows in the biggest table in millions> * 800 MB + 700 MB

    Example on a table of 5 million rows:

    heap size > 4700 MB

    Starting from version JChem PostgreSQL 2.3, the necessery heap size is increased:

    heap size > <number of rows in the biggest table in millions> * 870 MB + 700 MB

    Example on a table of 5 million rows:

    heap size > 5050 MB

    Heap size can be set by specifying the Xmx parameter in the JCHEM_PSQL_OPTS variable of the /etc/default/jchem-psql properties file.

  2. PostgreSQL settings
    If the user would like to execute queries

    1. on big tables - containing many entries and big row data. E.g., the molecule source is a verbose one (sdf, mrv, ...)

    2. where psql needs to fetch majority of the rows. E.g., sql query without any additional restricting condition or limit parameter

    then PostgreSQL needs to be able to fetch all rows rapidly requiring large memory because of the big table. In order to achieve this, set the shared_buffers parameter in postgresql.conf to be able to store your table.

    Table size can be checked by \dt+ <tablename> command from the psql client.

    Additionally, to enhance PostgreSQL performance, it may be advisable to increase linux's shared memory with the following command:

    sysctl -w kernel.shmall = <shared memory size in bytes>/<page size>

    The default value of page size is usually 4096. It can be checked by getconf PAGESIZE.To store the value of kernel.shmall permanently, add it it to the sysctl.conf file. For more details about required shared memory settings for PostgreSQL server please visit the PostgreSQL documentation.

    To reduce the needed buffer size you may consider the following possibilities.

    1. Change the input format to a concise one: smiles, smarts, ...
      (e.g., 8M rows of a PubChem dataset need ca. 17 GB when the molecules are in sdf format, but only ca. 1.4 GB when they are in smiles format).

    2. Avoid queries that require the fetching of all table data through adding additional "where" condition or limit parameter.

      For example, instead of

      SELECT <id_column_name> FROM <table_name> WHERE 'c1ccccc1' |<| mol;

      which may return several millions of hits, use the following statement applying LIMIT <n> and relevance sorting in order to obtain the most relevant n hits:

      SELECT <id_column_name> FROM <table_name> WHERE 'c1ccccc1' |<| mol ORDER BY relevance LIMIT 100;
    3. Increase the fillfactor to 90% instead of the above recommended 50% and perform vacuum regularly.

    4. As PostgreSQL is not optimized for the COUNT() method, it may take a long time if it returns large value. For obtaining an estimation of the count of hits we rather suggest using the EXPLAIN command.

    Please also check that the JVM heap size plus the PostgreSQL shared buffer size is not more than 2/3-rd of the total available memory to avoid slow-down of the operation system.

  3. Troubleshooting
    If searches on big tables are slow then probably the previously mentioned memory parameters need to be adjusted properly. However if searches on small tables are slow even after repeated execution, it may be a result of too large cache of the jchem-psql service.
    The cache size is estimated automatically based on heap size and based on average size molecules. It may cause too much heap consumption for big molecules. In such cases, it is suggested to define the two cached object count parameters manually in /etc/chemaxon/jchem-psql.conf. Set them to a value smaller than Xmx [in GB] * 1000000.