Document to Database Administration Guide

Architecture

A typical system will consist of three server computers:

  • a document repository (Documentum)

  • a crawler (Document to Database)

  • a database server (for instance Oracle with the JChem cartridge)

The document repository is supposed to be existing and working. It is not needed when indexing documents from a filesystem.

The database server should also be already installed and running.

This documentation concerns the installation and setup of the crawler server. It is recommended to dedicate a Linux machine for this task.

Installation

  1. Download d2db.zip

  2. Log onto the crawler machine as the desired user.

  3. unzip d2db.zip

  4. cd d2db

  5. cp -a conf.sample conf

You are now ready to start the configuration.

Configuration

The conf directory contains all configuration. You need to edit at least the d2db.conf file, which contains an example configuration and comments for each options.

If you are using Documentum as a document repository, you also need to edit documentum/dfc.properties to configure access to the Documentum server (host, username and password).

Commands

Document to Database command-line actions all have the following form:

./d2db <command> <parameters...>

Initialization

At this point you should be ready to run the first d2db command to initialize the database. This will create the necessary tables. Note that if you created the database schema yourself and only want d2db to populate it, you should have used the d2db.fixedSchema = true option (in configuration file schema.conf) and can skip this section.

To create the database tables, run this command once:

./d2db create

Getting statistics

Anytime after using d2db create, you can use the stats command to query some basic statics about the number of documents, chemical structures and hits in the d2db database. This is also a good way to check that the database is properly created and accessible.

For instance, running it just after create should give this output:

$ ./d2db stats[logging information]

Documents : 0
Unique structures : 0
Hits : 0

Indexing

The index command should be used to tell d2db which documents to index. For indexing a document folder, use:

./d2db index documentum:<folder>

For indexing a directory on a local or shared filesystem, use:

./d2db index <folder>

Note that d2db will automatically detect documents that have already been indexed in a previous run and have not been modified, in which case it will skip over them quickly. This means that the index command can be used both once for an initial indexing of a set of documents, and also later to update the index (add new documents, remove deleted documents, refresh modified documents). You can use the reindex command to force reindexing all documents even when they have not changed.

Once indexing has been done successfully, you might want to set up a cron job to run the index command regularly.