Document to Structure Developer's Guide

Introduction

The Document to Structure product finds chemical structures in documents. Chemical names in the text of document, structures embedded in Office documents, or image drawings of structure are all supported (see the user documentation for more details). The structures can then be exported to any supported molecule format, or manipulated in memory.

Basic API usage

Document to Structure plugs into the generic IO API of ChemAxon. This means that documents can be used exactly as other molecular formats (sdf, ...) as a source for importing structures.

Example usage:

// We have a document to process
File document = new File("document.pdf");

MolImporter importer = new MolImporter(document, "d2s");

// Iterate through the hits
for (Molecule m : importer) {
String smiles = MolExporter.exportToFormat(m, "smiles");
String name = m.getName();
String sourceText = m.getProperty(DocumentToStructure.SOURCE_TEXT);
//...
}

The exact same code can be used to import an XML file, a Microsoft Office document, ... The format is detected automatically.

The list of all available properties can be found in the API. Which property is available depends on the format. For instance, in text formats like xml, html and txt, the number of characters since the beginning of the file is available as DocumentToStructure.CHARACTER, while this has no value in a binary format.

Note that SOURCE_TEXT contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with m.getName().

Processing text directly

When the text to convert is given as a String object, the MolImporter object can be constructed with:

  String text = ...; MolImporter importer = DocumentToStructure.process(text); 

Configuring behavior

Document to Structure accepts options to configure how it behaves. All name to structure format options can be used with document to structure as well, to configure which name conversions are attempted. For instance, by default elements and ions are not converted when using d2s, as they may occur often in documents and are not always useful. However their conversion can be enabled, using:

MolImporter importer = new MolImporter(document, "d2s:+elements,+ions");

Document to Structure has specific options as well:

  • cas: enable the conversion of CAS numbers (uses a webservice, off by default).

  • smiles: enable the conversion of SMILES strings (on by default)

  • inchi: enable the conversion of InChI strings (on by default)

  • ocr: enable the processing of scanned text in PDF documents (on by default)

  • osr: enable the conversion of structure drawings by any available OSR external tool (on by default if any such tool is installed)

  • osra: enable the conversion of structure drawings by the OSRA external tool (on by default if OSRA is installed). Using this option will specify that OSRA should be used even if other OSR tools are available.

  • clide: enable the conversion of structure drawings by the CLiDE external tool (on by default if CLiDE is installed). Using this option will specify that CLiDE should be used even if other OSR tools are available.

  • imago: enable the conversion of structure drawings by the OSRA external tool (on by default if Imago is installed). Using this option will specify that Imago should be used even if other OSR tools are available.

  • timeout=N: the maximum number of seconds to run, with 0 for no timeout (default: no timeout)

  • osraTimeout=N: configure the maximum number of seconds to run OSRA on an image (default: 20 seconds)

  • clideTimeout=N: configure the maximum number of seconds to run CLiDE on an image (default: 20 seconds)

  • imagoTimeout=N: configure the maximum number of seconds to run Imago on an image (default: 20 seconds)

  • filterOSR: enable the filtering of OSR structures for incomplete recognition (on by default)

  • text: enable the conversion of all text based formats: name, smiles, InChI, CAS (on by default)

  • acronyms: enable the conversion acronyms, such as ATP for Adenosine TriPhosphate (off by default)

  • vernacular: enable the conversion of everyday terms like "water" or "steam" (off by default)

  • OLE: enable the conversion of structures embedded in office documents (on by default)

  • startPage=N: start processing document at page N (can be combined with endPage to process a range of pages)

  • endPage=N: stop processing document at page N

  • insideTag=<tag>: for markup formats, enable the conversion only inside the given tag (typically insideTag=body for HTML). Off by default.

  • contextRadius=N: maximum number of characters of context to include, on each side of the hit (default = 40).

  • contextIndex: whether to include the index of the hit in the context. Off by default.

Each option can be precedeed by a minus sign - (for instance -smiles) to disable it. Both forms smiles and +smiles are accepted to enable an option.

Monitoring progress

For estimating the progress of converting a document, you can use the standard method MolImporter.estimateNumRecords().

Command line usage

Document to Structure can be used as any other import file format. For instance, command line usage can be achieved by using MolConverter on a format supported by Document to Structure:

  molconvert sdf document.doc -o structures.sdf