Standardization

To ensure that search results are correct, query and target (database) molecules must share a similar representation. JChem Standardizer can be used to bring the two molecules involved in searching to this common format based on a configuration xml or action string.

Aromatization

Aromatization is the most important standardization step for searching. It converts different resonance forms of aromatic systems to a canonical aromatic representation by turning alternating single and double bonds to aromatic bond type. For the theory of aromaticity detection, see this link.

Aromatization can be performed either by directly calling MoleculeGraph.aromatize(true) function or by the help of the "aromatize" action of Standardizer. This has to be done manually when you use the MolSearch class, but is done in the high level StandardizedMolSearch and JChemSearch classes in case of default standardization. If you use SMARTS queries and are really keen on Daylight compatibility, using a daylight compatible aromatization may be important for you. In this case see MoleculeGraph.aromatize(int) .

Special care has to be taken when you are assembling a custom standardization configuration. In this case the "aromatize" action should be present in the configuration, and it is safest to put it first.

Furthermore, the alternating single or double bond representation of aromaticity should be used with caution in queries. It cannot be used to describe only part of an aromatic ring, because in this case aromatization cannot be performed.

Table 1.

 

target

images/download/attachments/41129154/arom001.png

images/download/attachments/41129154/arom002.png

images/download/attachments/41129154/arom003.png

images/download/attachments/41129154/arom004.png

query

images/download/attachments/41129154/arom003.png

images/download/attachments/41129154/no.png

images/download/attachments/41129154/no.png

images/download/attachments/41129154/yes.png

images/download/attachments/41129154/yes.png

images/download/attachments/41129154/arom007.png

images/download/attachments/41129154/yes.png

images/download/attachments/41129154/yes.png

images/download/attachments/41129154/no.png

images/download/attachments/41129154/no.png

images/download/attachments/41129154/arom005.png

images/download/attachments/41129154/yes.png

images/download/attachments/41129154/yes.png

images/download/attachments/41129154/no.png

images/download/attachments/41129154/no.png

images/download/attachments/41129154/arom006.png

images/download/attachments/41129154/yes.png

images/download/attachments/41129154/yes.png

images/download/attachments/41129154/no.png

images/download/attachments/41129154/no.png

Standardization in the database

In JChem databases, standardization is done automatically using the table standardizer configuration.

The database molecules are standardized during structure import into a JChem table (and also during structure update). First the original source of the chemical structure is stored in the cd_structure field, which can then be used for displaying and export purposes. The standardized form is then stored in the cd_smiles field in a compact format. This representation is used by the search process. All additional structure-dependent data (fingerprints, molecular weight and formula,

Chemical Terms calculated columns) are also calculated from the standardized form. In case of JChem index in the Cartridge, this process is done during index creation (and during structure insert/update in an indexed structure column), and the standardized form is stored within the index.

Query structures are standardized automatically before the search. These automatic standardization actions are always the same as the ones assigned to the table in which the search runs. This means that during database search the same standardizaton actions are applied to the query structures as to the target structures. By default two standardization actions are automatically executed: aromatization and removing of explicit hydrogens . Theoretical background and examples of aromatization can be found here.

There are two types of standardization in the database:

  • Default standardization: By default, the bonds of aromatic systems are replaced with aromatic bonds and explicit hydrogen atoms are transformed to implicit ones when possible. The query standardization actions are described here. The default standardization is adequate in most of the cases.

  • Custom standardization: In some cases custom standardization is necessary, e.g. nitro groups in the input structures are stored in different forms. You can define your own standardization rules with a Standardizer configuration (XML or action string). You can specify your custom configuration at table or index creation. Custom standardization requires a Standardizer license. This demo shows standardization in Instant JChem.

Standardization of query structures

To ensure that searching of queries with explicit hydrogens are correct, from JChem 5.0 the specified RemoveExplicitH actions are not performed on the query in case of substructure, full fragment and full searches, and in case of structures inserted into query tables. (The standardizer action attributes used for this purpose in previous JChem versions: "optional" flag or "query" and "target" groups are not necessary for RemoveExplicitH actions from JChem 5.0 and are deprecated, but these attributes will still work the same way as before for the time being.)

Configuration file of default standardization

The following configuration file describes the default standardization:

<!-- Standardizer configuration file -->
<StandardizerConfiguration>
    <Actions>
        <Aromatize ID="aromatize"/>
        <RemoveExplicitH ID="dehydrogenize"/>
    </Actions>
</StandardizerConfiguration>

The same in a short action string format is: "dehydrogenize..aromatize".
The following figure illustrates the Standardizer configuration builder and an example standardization transformation. images/download/attachments/41129154/standardization.png