Document to Structure Developer's Guide - Document Extractor
Introduction
The DocumentExtractor class extracts chemical names from text documents and converts them to chemical structures.
Basic API usage
Example usage:
// We have a document to process
java.io.Reader document = ...;
DocumentExtractor x = new DocumentExtractor();
x.processHTML(document); // or processPlainText(document) for input in plain text format
// Iterate through the hits
for (Hit hit : x.getHits()) {
System.out.println(hit.position + ": " + hit.text + ": " + hit.structure.toFormat("smiles"));
}
The field hit.position contains the position of the first character of the name in the document.
Note that hit.text contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with hit.structure.getName().
This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.
See also
-
Detailed code examples using Document to Structure in real-world situations.