Document to Structure Developer's Guide - Document Extractor

Introduction

The DocumentExtractor class extracts chemical names from text documents and converts them to chemical structures.

Basic API usage

Example usage:

// We have a document to process
java.io.Reader document = ...;

DocumentExtractor x = new DocumentExtractor();
x.processHTML(document); // or processPlainText(document) for input in plain text format

// Iterate through the hits
for (Hit hit : x.getHits()) {
System.out.println(hit.position + ": " + hit.text + ": " + hit.structure.toFormat("smiles"));
}

The field hit.position contains the position of the first character of the name in the document.

Note that hit.text contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with hit.structure.getName().

This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.

See also