JChem PostgreSQL Cartridge Manual
This manual serves as Administration Guide and API Developer Guide of JChem PostgreSQL Cartridge (JPC). Here you can find the description of software requirements, installation and configuration steps and running of JPC. The API Guide section provides information for the users and the developers. See also Getting started guide for easy setup and use cases.
If you are familiar with JChem Oracle Cartridge (JOC) and are interested in the differences between JOC and JPC, see their comparison here.
System requirement
Software
-
CentOS, RedHat or Debian operating system x86_64 version.
-
postgresql-94 relational database (preferred download page)
-
availability of hstore extension of postgresql-94 (e.g., contrib package)
-
Java8 runtime environment
Hardware
The memory need of JChem PostgreSQL Cartridge strongly depends on the format of the chemical structures to be stored. In the next table, the approximate memory need of 10 M PubChem molecules in SD file format and in SMILES format are compared. Further details can be found in the Performance tuning of searches in big tables section.
Indexed column containing 10 M PubChem molecules in |
Service jchem-psql [GB] |
Postgres shared buffer [GB] |
Total [GB] |
SD file format |
9 |
21 |
30 |
SMILES format |
9 |
2 |
11 |
Installation and Setup
Install PostgreSQL Cartridge
Download the latest version of PostgreSQL Cartridge from here. In order to install the latest version (as root), change the x.y (version information) in the following statement to the current one:
sudo yum localinstall jchem-psql/jchem-psql-x.y.x86_64.rpm
See also Getting started guide for PostgreSQL and Java8 installation.
Before the first use
-
Install your license file
Copy a valid ChemAxon license to /etc/chemaxon/license.cxl or set its location in /etc/chemaxon/jchem-psql.conf file. Required license in the license file is 'Postgres Cartridge'.
The jchem-psql user should have read access to the license file. -
Initialize index directory
sudo service jchem-psql init
If you encounter any problem in your java installation, set your JAVA_HOME in the file /etc/default/jchem-psql.
Initialization does not start the service, you must start it manually.
-
First start
sudo service jchem-psql manual-start
Create extension in a database
Each chemical database should have the following extensions created separately.
-
Create a new user and database
sudo su postgres
createuser testuser
createdb testdb -O testuser
-
Install the extensions (you must be postgres user or any other user who has create extension privilege)
psql testdb
testdb>
CREATE
EXTENSION chemaxon_type;
testdb>
CREATE
EXTENSION hstore;
testdb>
CREATE
EXTENSION chemaxon_framework;
-
Checking JChem-psql cartridge installation
If jchem-psql service is running, the following query can be executed without any problem:
testdb>
SELECT
'C'
::Molecule(
'sample'
) |<|
'CC'
::Molecule;
This select statement executes a substructure search, with a Carbon atom as the query and ethane as the target structure. You must receive true (t) as output. See details of Molecule type below.
Service
The service can be started/stopped using:
sudo service jchem-psql start
sudo service jchem-psql stop
ENABLED=1 (default) setting in /etc/default/jchem-psql file makes the service start on system boot and start manually by start.
If you want the service not to start on system boot, set ENABLED=0 in the /etc/default/jchem-psql file. In that case, manual-start must be run.
sudo service jchem-psql manual-start
Configure JChem-psql server
The following configuration parameters are available on PostgreSQL side:
Parameter name |
Value |
Description |
chemaxon.hit_retrieval_batch_size |
5000 |
Number of hits in a batch between jchem-psql server and posgresql database. Higher value generates higher initial latency, low value creates communication overhead. |
chemaxon.index_creation_batch_size |
5000 |
Number of records in a batch between jchem-psql server and posgresql database during index creation. Higher value generates higher memory footprint, but better throughput for traditional hard disk. SSDs can work very efficiently on low values. |
chemaxon.search_wall_time_limit |
600000 (10 minutes) |
The maximum wall time available for a search operation in milliseconds. |
They can be edited as an ordinary Postgres configuration parameter:
-
in the postgres.conf file (e.g., /etc/postgresql/9.4/main/postgresql.conf)
-
can be overwritten in the current session:
SET chemaxon.hit_retrieval_batch_size to
1000
;
-
to see the current value:
SHOW chemaxon.hit_retrieval_batch_size;
The following configuration parameters are available on JChem-psql server side:
Parameter name |
Description |
com.chemaxon.jchem.psql.idx.cachedObjectCount |
Number of fingerprints in cache. All of them are needed at every search, so it is advised always to keep them in memory for the frequently used tables. |
com.chemaxon.jchem.psql.main.cachedObjectCount |
Number of molecules in cache. Only the screened objects are searched for, so this cache can be left at a small size. Only the number of the objects can be set, so the total size of used memory depends on the current structures and can vary from time to time. |
com.chemaxon.jchem.psql.env.indexDir |
Directory storing the indexes needed by the cartridge. |
com.chemaxon.jchem.psql.env.userTypeDir |
Directory storing the molecule type files. |
com.chemaxon.jchem.psql.runtime.maxOpenSessionCount |
Limit on the number of open service sessions at same time (default is 100). Each PSQL command which uses jchem-psql service opens a new service session and closes it at the end. Service session requests above the set value of the open sessions are rejected with a message. Available from version 2.0. |
These parameters can be configured in the /etc/chemaxon/jchem-psql.conf file.
Molecule types
Molecule types define the interpretation mode of the chemical structures. Molecule types are based on the extension chemaxon_type. The definition of these types has to be stored in /etc/chemaxon/types/ folder as a <molecule_type_name>.type file. You can add, modify, delete molecule type files according to your needs.
We provide a sample type, see file /etc/chemaxon/types/sample.type.
The following settings can be defined in a <molecule_type_name>.type file:
-
Version of the type descriptor (at the moment, only '1' is accepted)
-
Type ID: positive unique identifier among types
-
Tautomer mode: OFF, GENERIC
-
Standardizer action string or Standardizer file containing standardization requirements
Standardizer action string must follow the syntax of command line standardizer actions:
Example:
aromatize:basic..addexplicitH..replaceatoms:queryatom=
'C'
:replaceatom=
'N'
Standardizer configuration file can be created as described in Creating a Configuration Standardizer page.
The types stored in /etc/chemaxon/types/ are loaded when the jchem-psql service is initialized. Therefore, if you add a new molecule type or change an existing one, the following steps have to be executed (using the present version of JChem Postgres Cartridge):
-
Create/modify/delete molecule type file(s) according to needs
-
Stop the service: sudo service jchem-psql stop
-
Delete the content of the index directory: sudo rm /var/lib/jchem-psql/v1/*
-
Initialize the service: sudo service jchem-psql init
-
Start the service: sudo service jchem-psql manual-start
-
Change to postgres user: sudo su postgres
-
Login to your database: psql
-
Reindex the database: reindex database <database name>
The query structure and the target structure(s) must always have the same molecule type. There is an auto-casting implemented; this means, that in the case of comparing two molecules, it is enough to define the molecule type of either of them. The following three statements have the same meaning:
SELECT 'C'::Molecule('sample') |<| 'CC'::Molecule('sample');
SELECT 'C'::Molecule('sample') |<| 'CC'::Molecule;
SELECT 'C'::Molecule |<| 'CC'::Molecule('sample');
Invalid select statements
-
missing molecule type definition
SELECT 'C'::Molecule |<| 'CC'::Molecule;
-
different molecule type definition
SELECT 'C'::Molecule('type1') |<| 'CC'::Molecule('type2');
In the case of database search, the molecule type of the structure column relates to the query structure as well.
Upgrade
Upgrade with keeping existing molecule type data
-
Stop the service
sudo service jchem-psql stop
-
Install the new version
-
Initialize the service (this step deletes the old indexes)
sudo service jchem-psql init
Only the index contents are deleted, the molecule records stored in the PostgreSQL database stay untouched.
-
Start the service
sudo service jchem-psql manual-start
-
Upgrade your existing postgresql database
-
Update the extensions
sudo su postgres
psql
# ALTER EXTENSION chemaxon_type UPDATE;
# ALTER EXTENSION chemaxon_framework UPDATE;
In some cases of the upgrade process you have to drop the indexes before altering the chemaxon_framework extension. If you receive an error message like
"ERROR: cannot drop operator class chemindex_int_ops for access method chemindex because other objects depend on it
DETAIL: index upg_ind depends on operator class chemindex_int_ops for access method chemindex
HINT: Use DROP ... CASCADE to drop the dependent objects too.",
you have to drop the index referred in the DETAIL part of the message and repeat step
# ALTER EXTENSION chemaxon_framework UPDATE;
-
Reindex the database
psql
# reindex database <database name>;
If you miss this step, all indexes of type chemindex will not function properly.
-
Upgrade without keeping existing molecule type data
-
Before installing the new version, drop the following extensions
DROP
EXTENSION chemaxon_framework
CASCADE
;
DROP
EXTENSION chemaxon_type
CASCADE
;
If you drop the chemaxon_type extension, it will implicitly drop all chemical columns.
-
Stop the service
sudo service jchem-psql stop
-
Delete the content of the index directory
sudo rm /var/lib/jchem-psql/v1/*
-
Install the new version
Uninstall
The present chemaxon_framework extension must be dropped:
DROP
EXTENSION chemaxon_framework;
DROP
EXTENSION chemaxon_type;
If you drop the chemaxon_type type extension, it will implicitly drop all chemical columns.
The installed jchem-psql-x.y.x86_64 package can be removed.
API Usage
CREATE TABLE
The type of the column where the chemical structures are stored has to be any of the MOLECULE types defined in /etc/chemaxon/types/.
Prerequisites
-
jchem-psql service must run
-
the extension named chemaxon_type must be created.
CREATE
TABLE
table_name (structure_column_name MOLECULE(
'molecule_type_name'
));
molecule_type_name must be specified in order to make the column searchable!
If you will run updates against a table which contains indexed molecule columns,
to reduce postgresql MVCC index entry creation/deletion traffic, postgres should be able to use HOT.
In order to facilitate the usage of HOT on your tables, use the 'fillfactor' storage argument. Read about fillfactor.
Example:
CREATE TABLE t (m molecule('sample')) WITH (fillfactor=50);
Example:
CREATE
TABLE
ttest (mol MOLECULE(
'sample'
));
The created column can be propagated with any of the well-known chemical structural data file formats. Due to postgres type system, the provided string value is automatically converted to MOLECULE type. For importing different molecule formats from local files, see Auxiliary functions for importing into a table.
MOLECULE INDEX
Indextype named chemindex has to be used when indexing a column containing chemical structures.
You can check whether this indextype exists or not:
SELECT
*
FROM
pg_am
WHERE
amname=
'chemindex'
;
Indextype chemindex can be used in the following way:
CREATE
INDEX
index_name
ON
table_name USING chemindex(structure_column_name);
Example:
CREATE
INDEX
ttest_idx
ON
ttest USING chemindex(mol);
Using the index in any type of searches can be forced by setting enable_seqscan parameter to OFF value in postgres configuration file /etc/postgresql/9.4/main/postgresql.conf).
We suggest closing all transactions and running a VACUUM before creating an index to avoid the calculation of the index data on rows that are being changed in the current transaction. E.g. if a column was added in the current transaction, the index creation time can be almost double of the normal time.
The same logic applies to updating a large number of molecules in an indexed table and then search on the table without closing the transaction and clearing up modification with VACUUM.
MOLECULE IMPORT
All well known chemical structural data file formats are supported, but we provide functions for importing mol, sdf files.
In case of big files, the postgresql client can run out of memory as the following error message displays:
ERROR: out of memory
DETAIL: Cannot enlarge string buffer containing
0
bytes by
1506367917
more bytes
Please, split up the original input file into smaller pieces and import them one after another. (Here is a script tool provided for splitting SD files.)
Import from SD file
Import from local SD/mol files containing added fields besides the chemical structure data
The content of all the added fields in the SD file will be stored in a hashmap like type column.Optionally selected added fields from the SD file can be stored in separate columns as well.
Follow the next steps:
-
Set the content of the SD file to a variable (e. g.: content)
\set variable_name `cat sdf_file_name`
Example:
\set content `cat ~/a.sdf`
You can check the table generated by the parse_sdf function
SELECT
*
FROM
parse_sdf(:
'content'
);
The parse_sdf function returns the content of the sdf file as a table, which has two columns:
molSrc = the full sdf source of the original entry in the sdf file
props = hstore type column, which contains the properties of the sdf entry
You can also create a table similar to the table created by the parse_sdf function by:
CREATE
TABLE
tableName (mol MOLECULE(
'sample'
), p hstore);
-
Create a table with create table as:
CREATE
TABLE
table_name
AS
SELECT
molSrc::molecule(
'molecule_type_name'
)
AS
structure_column_name[, props ->
'sdf_field_name'
AS
column_name]
FROM
parse_sdf(:
''
);
Example:
CREATE
TABLE
mysdftable
AS
SELECT
molSrc::molecule(
'sample'
)
AS
mol,
props ->
'MOLFORMULA'
AS
formula,
CAST
(props ->
'CDID'
AS
INTEGER
)
AS
id
FROM
parse_sdf(:
'content'
);
where
mysdftable = the name of the table
mol = the name of the column for storing structural data
molecule('sample') = the type of column mol
sample = molecule type (must be a defined type in /etc/chemaxon/types/ folder)
MOLFORMULA = the name of the field in the SD file for storing chemical formula
formula = the name of the column for storing chemical formula
CDID = the name of the field in the SD file storing the identifier
id = the name of the column for storing the identifier integer
You can also store all properties from the SD file entry in a separate column in the structure table, and later, optionally, add additional columns to separately store selected properties.
Example:
\
set
content `cat ~/a.sdf`
CREATE
TABLE
mytable
AS
SELECT
molSrc::molecule(
'sample'
)
AS
mol,
props
AS
properties
FROM
parse_sdf(:
'content'
);
ALTER
TABLE
mytable
ADD
COLUMN
formula TEXT;
UPDATE
mytable
SET
formula = properties ->
'MOLFORMULA'
Collect invalid molecules
Available from version 1.8.
Running the steps below, the molecules in SD/mol files will be checked during the import and the invalid molecules will be stored in a separate table named <new_table_name>_error.
Steps:
-
Run the import_sdf.sql script to create function import_sdf. This script file is a customizable tool; you can update it according to your needs.
\i import_sdf.sql;
-
Run the following commands including import_sdf function with the appropriate parameters:
\
set
<variable_name> `cat <path_to_your_sdf_file>`;
SELECT
import_sdf(:
'<variable_name>'
,
'<new_table_name>'
,
'<molecule_type>'
);
Example:
\
set
content `cat ~/a.sdf`;
SELECT
import_sdf(:
'content'
,
'molecules'
,
'sample'
);
Import from (cx)smiles or (cx)smarts files
Import from (cx)smiles or (cx)smarts files by the standard copy sql function
COPY table_name (structure_column_name)
FROM
'file_name'
(FORMAT csv);
Example:
COPY ttest(mol)
FROM
'/home/posgresuser/targetfiles/nci-pubchem_1m_unique.smiles'
(FORMAT csv);
Import from(cx)smiles or (cx)smarts files while collecting the invalid molecules
Available from version 1.8
Running the steps below, the molecules in smiles/cxsmiles/smarts/cxsmarts files will be checked during the import and the invalid molecules will be stored in a separate table named <new_table_name>_error . We propose the following two methods: PL/pgSQL script method and SQL commands method.
PL/pgSQL script method
We suggest applying this script method when the server and the client are on the same machine; or at least the molecule files are available on the server and their absolute path is known. Otherwise, apply the steps of SQL commands method.
Steps:
-
Run the import_single_line_format.sql script to create function import_single_line_format. This script file is a customizable tool; you can update it according to your needs.
\i import_single_line_format.sql;
-
Run the following SELECT using import_single_line_format function with the appropriate parameters:
SELECT import_single_line_format(
'<absolute_path_to_my_file_on_server>'
,
'<new_table_name>'
,
'<molecule_type>'
);
Example:
SELECT
import_single_line_format(
'/home/myuser/molecules.smiles'
,
'molecules'
,
'sample'
);
SQL commands method
-
Create a table which will contain the valid molecules
CREATE
TABLE
my_table(mol TEXT);
-
Import from file
\COPY my_table
FROM
'~/molecules.smiles'
(FORMAT csv);
-
Create and load a table for the invalid molecule sources
CREATE
TABLE
my_table_error
AS
SELECT
mol
FROM
my_table
WHERE
NOT
is_valid_molecule(mol);
-
Remove invalid molecules from my_table
DELETE
FROM
my_table
WHERE
mol
IN
(
SELECT
mol
FROM
my_table_error);
-
Convert the valid molecule sources into molecule
ALTER
TABLE
my_table
ALTER
COLUMN
mol TYPE molecule(
'sample'
) USING mol::molecule(
'sample'
);
Convert molecule source text to molecule from an existing table
Follow the steps (starting from step 3) of the SQL commands method above.
MOLECULE FUNCTIONS
Interpreting a string using a specific format
As molecule strings are ambiguous in some cases, it is possible to interpret the given molecule string according to the given molecule format using the molecule(String, String) function:
molecule(structure_string, molecule_format)
where
structure_string = molecule string representation
molecule_format = molecule format string, see file formats
SELECT
*
FROM
table_name
WHERE
molecule(structure_string, molecule_format) |<| column_name;
A typical usecase is the differentiation between SMILES and SMARTS strings:
-
'CC' with SMILES notation is interpreted as two aromatic or aliphatic carbon atoms connected with a single bond.
-
'CC' with SMARTS notation is interpreted as two aliphatic carbon atoms connected with a single bond.
Example:
SELECT
*
FROM
ttest
WHERE
molecule(
'CC'
,
'smiles'
) |<| mol;
SELECT
*
FROM
ttest
WHERE
molecule(
'CC'
,
'smarts'
) |<| mol;
Transformations
Transformation function is provided to transform the molecule structure according to specific needs. Multiple transformation strings can be provided separated by two dots.
query_transform(molecule,
'transformation_string1..transformation_string2'
)
where
molecule = "molecule string" | Molecule object | table column
transformation string = "fullfragment" | "dbsmarkedonly"
Molecule transformation combined with substructure search:
SELECT
*
FROM
ttest
WHERE
query_transform(molecule,
'transformation string'
) |<| mol;
-
Double bond stereo option: dbsmarkedonly
By default, CIS query matches only with CIS target and TRANS query matches only with TRANS target. If matching of CIS query with both CIS and TRANS targets or matching of TRANS query with both CIS and TRANS targets is aimed, then the query molecule has to be transformed.
The query_transform function with dbsmarkedonly option is provided to accomplish this transformation.
query_transform(
'query_structure'
,
'dbsmarkedonly'
)
Examples:
SELECT
*
FROM
ttest
WHERE
query_transform(
'C\C=C\C'
,
'dbsmarkedonly'
) |<| mol;
SELECT
*
FROM
ttest
WHERE
query_transform(
'C\C=C\C'
,
'fullfragment..dbsmarkedonly'
) |<| mol;
-
Full fragment (exact fragment) search option: fullfragment
Full fragment search can be executed by transforming the query structure in a way that it matches only a full fragment of the target structure and by executing a substructure search on this modified query structure. The transformation adds 's*' query search property to all atoms.
query_transform(query_structure,
'fullfragment'
)
Examples:
SELECT
*
FROM
ttest
WHERE
query_transform(
'C1CCCCC1'
,
'fullfragment'
) |<| mol;
Fingerprint generation
Fingerprints can be generated using the following function:
fingerprint(chemical_structure,size)
or
fingerprint(structure_column_name,size)
where size is the number of bits in the fingerprint bit string; it must be divisible by 32, e.g.: 512.
Example:
SELECT
fingerprint(
'C1CCCCC1'
,512);
Chemical terms
(Available since version 1.4.)
Calculation of chemical terms
Function chemterm makes possible to calculate chemical terms.
You may need to purchase separate licenses to apply function chemterm depending on the Chemical Terms used.
chemterm(
'chemical_term'
,
'structure'
)
where
structure = a molecule string
chemical_term = a chemical term function
The output of function chemterm is one of the following formats:
-
string
-
molecule string in format mrv
Examples:
SELECT
chemterm(
'name'
,
'CCO'
);
SELECT
chemterm(
'mass()'
,
'CCO'
);
SELECT
molconvert(chemterm(
'canonicalTautomer()'
,
'CC(O)=C'
)::Molecule,
'smiles'
);
Addition of chemical term columns
Chemical term columns can be added by using triggers.
See the following code block as an example:
CREATE
TABLE
test(structure MOLECULE(
'sample'
), molweight
NUMERIC
);
CREATE
OR
REPLACE
FUNCTION
set_molweight()
RETURNS
trigger
AS
$BODY$
BEGIN
IF NEW.structure
is
null
THEN
NEW.molweight:=
NULL
;
END
IF;
NEW.molweight:=chemterm(
'mass()'
,NEW.structure)::
real
;
RETURN
NEW;
END
;
$BODY$
LANGUAGE plpgsql;
DROP
TRIGGER
IF EXISTS tr_molweight
ON
test;
CREATE
TRIGGER
tr_molweight BEFORE
INSERT
OR
UPDATE
ON
test
FOR
EACH ROW
EXECUTE
PROCEDURE
set_molweight();
UPDATE
test
SET
molweight=chemterm(
'mass()'
,structure)::
real
;
Molconvert
The use of ChemAxon's MolConverter is supported with some limitations:
molconvert(
'structure'
,
'format'
)
where
structure = a Molecule in any of the following formats
format = mrv, mol, rgf, sdf, rdf, csmol, csrgf, cssdf, csrdf, cml, smiles, cxsmiles, abbrevgroup, sybyl, mol2, pdb, xyz, inchi, or name
Example:
SELECT
molconvert(
'CC'
,
'sdf'
);
Molecule validation
Available from version 1.8.
The following function is provided to check the validity of the molecules:
is_valid_molecule(structure_text)
where
structure_text = a Molecule in any of the following formats:
mrv, mol, rgf, sdf, rdf, csmol, csrgf, cssdf, csrdf, cml, smiles, cxsmiles, abbrevgroup, sybyl, mol2, pdb, xyz, inchi, or name
Example:
SELECT
is_valid_molecule(
'CCO'
);
You can filter out the invalid structures from a table using the following SELECT statement:
SELECT
structure_text
from
mytable
where
is_valid_molecule(structure_text) =
'f'
;
MOLECULE OPERATORS
The main features of the different search types are available in JChem Query Guide.
The molecule operators work also in tables without chemical index, however, chemical index makes the search operations much faster.
Substructure search
Substructure search is performed using the symmetrical sub-/super-structure search operator: |<|.
SELECT
*
FROM
table_name
WHERE
query_structure |<| structure_column_name;
SELECT
*
FROM
table_name
WHERE
structure_column_name |>| query_structure;
where
query_mol = "molecule string" | Molecule object | table column
target_mol = "molecule string" | Molecule object | table column
Examples:
SELECT
'[#6]-[#6]'
|<|
'CC'
::Molecule(
'sample'
);
SELECT
*
FROM
ttest
WHERE
'c1ccccc1'
|<| mol;
SELECT
*
FROM
ttest
WHERE
'testmol
Mrv0541 01211514572D
3 2 0 0 0 0 999 V2000
1.2375 -0.7145 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9520 -1.1270 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6664 -0.7145 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
M END
'
|<| mol;
See also molecule functions about defining the query structure format.
Superstructure search
Superstructure search is performed using the sub-/super-structure search operator: |<|.
SELECT
*
FROM
table_name
WHERE
query_structure |>| structure_column_name;
SELECT
*
FROM
table_name
WHERE
structure_column_name |<| query_structure;
Example:
SELECT
*
FROM
ttest
WHERE
'CCC'
|>| mol;
Full fragment (exact fragment) search
Full fragment search can be executed by transforming the query structure in a way that it matches only a full fragment of the target structure and by executing a substructure search on this modified query structure.
See details in transformations for full fragment search.
SELECT
*
FROM
table_name
WHERE
query_transform(query_structure,
'fullfragment'
) |<| structure_column_name;
Example:
SELECT
*
FROM
ttest
WHERE
query_transform(
'CC'
,
'fullfragment'
) |<| mol;
Duplicate search
Duplicate search is performed using the |=| operator.
SELECT
*
FROM
table_name
WHERE
query_structure |=| structure_column_name;
SELECT
*
FROM
table_name
WHERE
structure_column_name |=| query_structure;
Example:
SELECT
*
FROM
ttest
WHERE
'CCC'
|=| mol;
Tautomer search
You have to apply such a molecule type which has tautomer = GENERIC tautomer mode.
How is chemical matching of the query and the target executed in tautomer search? The generic tautomer - representing all theoretically possible tautomers - of the target is matched with the query structure itself. This method is applied in substructure search, full fragment search, duplicate, and superstructure search.
Limitations:
-
SMARTS atoms and SMARTS bonds in the query structures are not supported.
-
The use of bond lists in query structures may slow down the search.
Similarity search
For similarity searches you must have a column filled with fingerprints. These fingerprints can correspond to molecules in another table (method 1) or to molecules in the same table (method 2). We strongly advise using the other table method (method 1) because of the performance. The fingerprint data are usually small and PostgreSQL database engine can store them in memory.
Fingerprints can be generated as described above.
It is advised to name the fingerprint column as fp - in order to make similarity search easily runnable from applications based on JChem PostgreSLQ Cartridge API.
The following statements should be used for running similarity search using method 1:
CREATE
TABLE
molecule_table_name(structure_column_name MOLECULE(
'molecule_type_name'
), id
INTEGER
);
CREATE
INDEX
molecule_table_idx
ON
molecule_table_name(id);
CREATE
TABLE
fingerprint_table_name
AS
SELECT
fingerprint(structure_column_name,512) fp, id
FROM
molecule_table_name;
CREATE
INDEX
fingerprint_idx
ON
fingerprint_table_name(fp);
CREATE
INDEX
fingerprint_id_idx
ON
fingerprint_table_name(id);
SELECT
*
FROM
fingerprint_table_name
WHERE
tanimoto(fingerprint(query_structure,512),fp) operator similarity_value;
SELECT
*
FROM
(
SELECT
ft.*, tanimoto(fingerprint(query_structure,512),fp)
AS
tanimoto
FROM
fingerprint_table_name
AS
ft)
WHERE
tanimoto operator similarity_value
ORDER
by
tanimoto
DESC
;
where
operator can be <, <=, =, >, >=
similarity_value is a number between 0 and 1
fp is the recommended name of the fingerprint column
Example:
CREATE TABLE moltable(mol MOLECULE(
'sample'
), id INTEGER);
CREATE INDEX moltable_idx ON moltable(id);
Insert molecules into moltable and/or create a trigger, then continue with the followings:
CREATE
TABLE
fptable
AS
SELECT
fingerprint(mol,512) fp, id
FROM
moltable;
CREATE
INDEX
fptable_idx
ON
fptable(fp);
CREATE
INDEX
fptable_id_idx
ON
fptable(id);
SELECT
*
FROM
fptable
WHERE
tanimoto(fingerprint(
'CCC'
,512),fp) > 0.9;
--to get the results in descending order by similarity
SELECT
t.id,t.tanimoto, moltable.mol
FROM
(
SELECT
fptable.*, tanimoto(fingerprint(
'CCC'
,512),fp)
AS
tanimoto
FROM
fptable)
AS
t, moltable
WHERE
t.id = moltable.id
AND
tanimoto > 0.9
ORDER
BY
tanimoto
DESC
;
The following statements should be used for running similarity search using method 2:
CREATE
TABLE
table_name(structure_column_name MOLECULE(
'molecule_type_name'
));
ALTER
TABLE
table_name
ADD
COLUMN
fp BYTEA;
UPDATE
table_name
SET
fp = fingerprint(structure_column_name,512);
SELECT
*
FROM
table_name
WHERE
tanimoto(fingerprint(query_structure,512),fp) operator similarity_value;
Example:
CREATE
TABLE
simtable(mol MOLECULE(
'sample'
));
ALTER
TABLE
simtable
ADD
COLUMN
fp BYTEA;
UPDATE
simtable
SET
fp=fingerprint(mol,512);
SELECT
*
FROM
simtable
WHERE
tanimoto(fingerprint(
'CCC'
,512),fp) > 0.9;
Please note that fingerprint values present in fingerprint columns must be recalculated when molecules are updated.
Known Issue
Update of tables with more than about 1000 records might be very slow.
You can create a trigger to update the fingerprint column in the fingerprint table (fptable) when new records are inserted into the molecule table (moltable).
Example:
CREATE TABLE moltable(mol MOLECULE(
'sample'
), id INTEGER);
CREATE TABLE fptable AS SELECT fingerprint(mol,
512
) fp, id FROM moltable;
CREATE OR REPLACE FUNCTION set_fingerprint()
RETURNS trigger AS
$BODY$
BEGIN
IF (TG_OP =
'DELETE'
) THEN
DELETE FROM fptable WHERE id = OLD.id;
RETURN OLD;
END IF;
IF (TG_OP =
'UPDATE'
) THEN
IF NEW.mol is
null
THEN
UPDATE fptable SET id = NEW.id, fp =
null
where id = OLD.id;
END IF;
UPDATE fptable SET id = NEW.id, fp = fingerprint(NEW.mol,
512
) where id = OLD.id;
RETURN NEW;
END IF;
IF (TG_OP =
'INSERT'
) THEN
IF NEW.mol is
null
THEN
INSERT INTO fptable (fp, id) VALUES (
null
,NEW.id);
END IF;
INSERT INTO fptable (fp, id) VALUES (fingerprint(NEW.mol,
512
),NEW.id);
RETURN NEW;
END IF;
END;
$BODY$
LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS tr_fingerprint ON moltable;
CREATE TRIGGER tr_fingerprint BEFORE INSERT OR UPDATE OR DELETE ON moltable
FOR EACH ROW EXECUTE PROCEDURE set_fingerprint();
Combination of structure query AND structure/non-structure query
-
WHERE condition referring to more than one column containing chemical structure data is not supported.
-
WHERE condition referring to one structure column and one or more columns containing non-chemical structure data is supported.
Relevance sorting
Available from version 1.8.
As in standard SQL, the user can order his results using ORDER BY commands.
For ordering search results, JChem PostgreSQL function relevance(Molecule) is provided, which gives back a numeric type value based on the atom counts and further topological features of the molecule.
It is suggested that relevance values be stored in the table for further query. It is also suggested that an index is created on the relevance column and further queries return their results ordered by the relevance value. This facilitates the usage of LIMIT <n> conditions as the most relevant hits are at the beginning of the result set.
ALTER
TABLE
<mytable>
ADD
COLUMN
<relevance_column>
INT
;
UPDATE
<mytable>
SET
<relevance_column> = relevance(mol)::
int
;
CREATE
INDEX
<relevance_index>
on
<mytable>(<relevance_column>);
SELECT
mol
FROM
<mytable>
WHERE
'query_structure'
|<| mol
ORDER
BY
<relevance_column> LIMIT <n>;
Example (assuming <mytable> has column "mol" of type Molecule):
ALTER
TABLE
test
ADD
COLUMN
relev
INT
;
UPDATE
TEST
SET
relev = relevance(mol)::
int
;
CREATE
INDEX
relev_idx
on
test(relev);
SELECT
mol
FROM
test
WHERE
'c1ccccc1'
|<| mol
ORDER
BY
relev LIMIT 100;
Other chemical features/measures may also be used for ordering, the chemterm function provides help for their definition.
You can create a trigger to update the relevance column when new records are inserted into the table.
Example:
CREATE TABLE test (mol Molecule(
'sample'
));
ALTER TABLE test ADD COLUMN relev INT;
CREATE OR REPLACE FUNCTION set_relevance()
RETURNS trigger AS
$BODY$
BEGIN
IF NEW.mol is
null
THEN
NEW.relev:=NULL;
END IF;
NEW.relev:=relevance(NEW.mol)::
int
;
RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS tr_relevance ON test;
CREATE TRIGGER tr_relevance BEFORE INSERT OR UPDATE ON test
FOR EACH ROW EXECUTE PROCEDURE set_relevance();
UPDATE test SET relev=relevance(mol)::
int
;
AUXILIARY FUNCTIONS
Debugging
Raising the log level of the psql-client for debugging purposes:
SET
client_min_messages
to
debug;
Performance log
You can log the performance of the current session as:
SELECT
*
FROM
perf_out();
You can clear the log as (available since version 1.4):
SELECT
perf_reset();
Performance tuning of combined queries
Available since version 1.4.
The performance of searching with combined queries - when chemical structure query condition is combined with a non-structure query condition - might be improved by executing the following calibration steps.
-
Prerequisite: the column structure_column_name in table table_name used for the calibration must be indexed using indextype chemindex.
-
Calibration of cost factors
select
calibrate_cost_factors(
'table_name'
,
'structure_column_name'
,
'query_structure'
);
where
query_structure is an optional parameter. If not specified, chlorobenzene (Clc1ccccc1) is used by default.
The applied query_structure is an important key factor of the calibration. The best specified query_structure has almost the same (+/- 10%) estimated selectivity in the given table as the number of its search hits. We recommend the next explain statement to run in order to get the estimated selectivity. The number of rows given in the output of the next statement is the estimated selectivity.
explain
select
*
from
table_name
where
'query_structure'
|<| structure_column_name;
In order to get better performance, you may need to increase the default_statistics_target value for PostgreSQL analyzer. Its default value 100 can be increased to maximum 10000.
-
Application of the calibrated cost factors
Please follow the suggested solutions given in the output of the select calibrate_cost_factors statement.Example:
To set the values permanently, add the lines
chemaxon.cost_slope_factor = -5.14e-07
chemaxon.cost_intercept_factor = 0.03866
to the PostgreSQL configuration file (e.g.: /etc/postgresql/9.4/main/postgresql.conf)
To set values only for the current session, execute the following commands:
SET chemaxon.cost_slope_factor = -
5
.14e-
07
;
SET chemaxon.cost_intercept_factor =
0.03866
;
Performance tuning of searches in big tables
Available since version 1.8
Searching in big table can be a very time consuming operation if the available memory cannot store all the necessary data of the whole table. The memory management operations could increase the search times.
Performance of searches in big tables can be tuned by applying the following JVM heap size and PostgreSQL settings, furthermore, the SELECT statements used for querying can be refined.
-
JVM heap size settings
In order to ensure that the biggest table fits into the memory, the approximate minimum heap size can be calculated as:heap size > <number of rows in the biggest table in millions> *
800
MB +
700
MB
Example on a table of 5 million rows:
heap size >
4700
MB
Heap size can be set by specifying the Xmx parameter in the JCHEM_PSQL_OPTS variable of the /etc/default/jchem-psql properties file.
-
PostgreSQL settings
If the user would like to execute queries-
on big tables - containing many entries and big row data. E.g., the molecule source is a verbose one (sdf, mrv, ...)
-
where psql needs to fetch majority of the rows. E.g., sql query without any additional restricting condition or limit parameter
then PostgreSQL needs to be able to fetch all rows rapidly requiring large memory because of the big table. In order to achieve this, set the shared_buffers parameter in postgresql.conf to be able to store your table.
Table size can be checked by \dt+ <tablename> command from the psql client.
Additionally, to enhance PostgreSQL performance, it may be advisable to increase linux's shared memory with the following command:
sysctl -w kernel.shmall = <shared memory size in bytes>/<page size>
The default value of page size is usually 4096. It can be checked by getconf PAGESIZE.To store the value of kernel.shmall permanently, add it it to the sysctl.conf file. For more details about required shared memory settings for PostgreSQL server please visit the PostgreSQL documentation.
To reduce the needed buffer size you may consider the following possibilities.
-
Change the input format to a concise one: smiles, smarts, ...
(e.g., 8M rows of a PubChem dataset need ca. 17 GB when the molecules are in sdf format, but only ca. 1.4 GB when they are in smiles format). -
Avoid queries that require the fetching of all table data through adding additional "where" condition or limit parameter.
For example, instead of
SELECT
<id_column_name>
FROM
<table_name>
WHERE
'c1ccccc1'
|<| mol;
which may return several millions of hits, use the following statement applying LIMIT <n> and relevance sorting in order to obtain the most relevant n hits:
SELECT
<id_column_name>
FROM
<table_name>
WHERE
'c1ccccc1'
|<| mol
ORDER
BY
relevance LIMIT 100;
-
Increase the fillfactor to 90% instead of the above recommended 50% and perform vacuum regularly.
-
As PostgreSQL is not optimized for the COUNT() method, it may take a long time if it returns large value. For obtaining an estimation of the count of hits we rather suggest using the EXPLAIN command.
Please also check that the JVM heap size plus the PostgreSQL shared buffer size is not more than 2/3-rd of the total available memory to avoid slow-down of the operation system.
-
-
Troubleshooting
If searches on big tables are slow then probably the previously mentioned memory parameters need to be adjusted properly. However if searches on small tables are slow even after repeated execution, it may be a result of too large cache of the jchem-psql service.
The cache size is estimated automatically based on heap size and based on average size molecules. It may cause too much heap consumption for big molecules. In such cases, it is suggested to define the two cached object count parameters manually in /etc/chemaxon/jchem-psql.conf. Set them to a value smaller than Xmx [in GB] * 1000000.