IJC tutorial: Query building
Overview
This tutorial will explain how to execute queries in Instant JChem. Queries might be relatively simple that use one or two simple terms and search against standard fields. More often, queries will contain many terms, involve a structure based field and may contain AND/OR logic in order to filter the result set. In this tutorial we will attempt to move rapidly from simple to complex examples so you can interact with your organisations data model in an efficient and meaningful way. In doing so we will move from the view based querying to the query builder window which is designed to support complex query expression building. Managing multiple distinct queries and the associated hit lists that are generated are discussed in a separate but closely related tutorial list and query management.
You can start with the built in example demo project to try these concepts and functionality out. Later on in the tutorial we switch to the NCI data set in order exemplify the more advanced concepts involved.
Contents:
Create the demo Project.
First create a new project container. Use File -> New Project... menu entry or appropriate icon in the toolbar (shortcut - Ctrl+Shift+N). Create a new project and choose IJC Project (local database with demo data). You will notice that even though no explicit policy is applied, you are connected as the implicit user admin.
Run some simple grid based queries.
Expand the "Pubchem Demo" data tree in the Projects window in order to see the listed views. To start with, Open the "Pubchem single record form" view and the "Pubchem grid view". These views are pointing to the same data sources and so any results will be synchronised across the two views. Note that on opening of either form type you will default to Browse mode with all records available for scrolling through.
First select the grid view, then switch from Browse mode to Query mode. At any time to return this starting point you can use the Show All button to list all the records.
Once in Query mode we will see the data is replaced by a series of expression boxes, one for each field in the entity. You can enter any number of query criterion to apply within them by left click on the relevant field. If you right click in the field you will see a context specific menu or operators that are available to specify in the query. Right click in the CD_MOLWEIGHT field and select the less than operator and type in 400. If any expression is incorrectly formed, the whole expression will remain red until it is corrected (then it turns back to black). To execute the query you can hit enter or press the File -> Run Query button or use the Ctrl+Q key combination. You should now see 756 / 1000 of the rows meet this criterion. Next, you can reverse the operator. You will notice the current one is ticked. Select the greater than operator and re-run the query. You should now see 244 / 1000 of the rows meet this criterion and logically the two sets are mutually exclusive.
Each field's list of available operators (right click menu) is dependent upon the field type. The example above used a "decimal" numerical data type which has the least number of applicable operators. Two other common standard fields to query are integer and string types which contain an expanded list of operators. Right click on the IUPAC name field and select the contains operator and next to it write "chloro" - you should see 58 / 1000 hits.
You can press the Clear Query button to remove the query. You can apply criterion to more than one field at a time and the additional filters will be applied. Next, we can identify all records that have 4, 5 or 6 donors using the between operator. Right click in the donors field and select between, then type "4 and 6" and run the query - you should see 151 / 1000 hits. Additionally right click in the Rot bonds field and select <= (less than or equals) and then write "4" and run the query - you should see 75 / 100 hits.
Run some simple structure field based query in the form view.
A lot can be achieved with the grid view including structure based queries. This is achieved by double click on the structure field, followed by the use of the Marvin editor to set a query structure.
The grid view is limited in that it can query a single entity only. Also the molpanel widget is slightly more obvious in the form view, hence we shift our attention now to running structure based queries in the form view. First, click on the open form view named "Pubchem single record form" and again you will see 1000 records. Next click on Query mode and you will then see the form widgets as empty fields just like the grid view. Double click on the empty Molpanel widget and the Marvin editor will open.
We will now complete some simple Substructure searches. This type of search is most common and is thus the default. In the next section we will see how to modify the search type using the structure fields right click operator menu. First lets search for Pyridine - you should see 34 / 1000 hits. Next, lets search for Pyrrole. You can select Clear Query and then double click on the field to open Marvin again. You will notice that Pyridine (the last query) is still in the editor. Remove it using the Marvin Delete or Rubber functions. Draw the Pyrrole structure and run the query - you should see 18 / 1000 hits. It is possible to add additional standard field query criterion via either the form or grid view (as above) to build up the number of filters applied for the query - we leave you to experiment with combining terms in this way.
We will see how to run a Substructure search that consider both of these query criterion below using the query builder. Finally here, we will extend our substructure search to include an atom list in the query which will allow us to search for all records that contain the Pyrrole OR the Furan fragments and this can be defined in the Marvin editor's periodic table and Atom List button - please see the query features documentation for information on formulating advanced structural queries and associated options. You should see a total of 20 / 1000 hits for this query.
Exploring the different structure field query types available.
So far we have just run some Substructure searches and next we will expand upon this and explore the other search types available for the structure field. So far we have used the Wombat data set but it is relatively small and limited, hence we will now use a larger data set to illustrate the possibilities. You will most probably use your own data from now on (but of course you can continue to use the demo data). In order to see the query types available in the right click menu, logically the field in QUERY mode must have some structure set (for now we just use Benzene to make the point).
First lets try a Substructure search for a more complex molecule in the data set (such as the example show in the screen shot below, choose one from the dataset you are using). First draw or import the molecule using Marvin editor, press the Set Query button. Next press the Run Query button. Next try a Superstructure search using the same query. In Query mode right click on the Molpanel and change the search type to Superstructure and re-execute the query. If you scroll through the hit set you can begin to see which fragments from the complex molecule (highlighted in blue) do actually exist in the data set and with what frequency.
Next we will try the same molecule and use a Similarity search (2D fingerprint based). Right click in the Molpanel whilst in the Query mode and switch the search type to Similarity. Next press the run Run Query button.
Next, we can use the options button to filter out the best scoring hits. Right click in the Molpanel whilst in the QUERY mode and select Options. Modify the Similarity Threshold variable and set it to 0.7 and this will act as a 'floor' value filtering out all hits that score less than this value - you should now have less hits. You can also modify the coefficient here but not the Similarity search type.
Next, lets try out a duplicate search for a molecule such as the one below (choose one from the dataset you are using), you should see one hit.
Next, lets try out a Substructure search for the Sulphuric acid fragment. Right click and change the query type to Full fragment. You will obtain less hits than for the Substructure search. The difference are those where the Sulphuric acid is a Substructure of a fragment rather than a discrete or full fragment match to Sulphuric/Sulphate.
Finally, lets run a Substructure search with an additional Chemical terms query applied. Draw the Indole structure then right click and select Chem terms and then the "Lipinski rule of 5 (3 of 4)" expression. You will notice the Molpanel in QUERY mode reminds you that the Chemical term is also set.
This last example leads us into the next section. Go to Window -> Query Builder... to open the query builder window (shortcut Ctrl+B). It can be easier to use this view when you begin constructing more complex queries.
Using the Query Builder to execute complex queries.
The query builder exists to assist you in a elegant visual manner to construct your complex queries but it does also extend the query capability, since thus far: only fields present on the form view or grid view can by included in the query criteria, a field can only be present once in the query and further complex AND or OR logic has been not possible, until now. Further to this, query criteria are entered primarily by typing, so you need to know the correct syntax where as the query builder is more visual with menus, drop down boxes and mouse control. In this way building relational queries are also assisted.
Select your data entity and clear all previous queries. In the query builder window right click and select add field, Structure.
Double click on the pane and draw pyridine in Marvin, run query and note the number of hits. Now a run a query for pyrrole and note the second number.
Next we will run a query that contains both these substructures. Add another structure field and ensure each different query Pyridine and Pyrrole is correctly drawn. Notice the root logic defaults to AND. Select Run Query and note the third number. These are the records that contain both substructures. Next right click on the root node operator and change the logic from AND to OR. Select Run Query and note the fourth number. This set will contain the hits found in the AND query (third number) which means the fourth number should be higher than the third. Also, it will be higher than the first and second one, but possibly lower than their sum as one structure might contain both pyridine and pyrrole substructures (which equals the third number).
Next lets try selecting the non-hits tick with the AND variant, which should give us a reverse query, which contains nether of these substructures. You should obtain (number_of_structures_in_the_data_set) - (fourth_number) hits.
It is also possible to complete the same OR query using a single Structure field and a more sophisticated Marvin query which defines both ring sizes 5 and 6. Please see the section below on SQL where we will examine the statements generated for each of these queries.
Sometimes, setting up a more complex query might become difficult. Instant JChem contains the possibility to show the query as a text to make sure the query matches your needs. You can open the “Query As Text” panel by going to Window > Query As Text.
A new panel opens containing the text representation of the query set for the active data tree. As an example we display the query for pyridine OR pyrrole substructure.
Queries based on data from different entities
Another useful feature of the query builder is its ability to use fields from multiple entities bound by relationships for query definitions. In our case we will demonstrate this functionality using the Wombat (compound view) data tree. Let’s open the Wombat (compound view) compounds form view and keep the query builder open. We will define a query using one field from the root entity of this data tree (Wombat structures) and one field from one of the child entities (in our case, Wombat activities).
In the query builder window, right click and select add field, Structure. Now double-click into the pane, draw benzene in the Marvin sketcher and click the Set query button. Now, add the BIO.CELL field from the Wombat activities entity to the query by going to right-click menu (on the AND Wombat structures line) in the query builder and selecting Add detail field > Wombat activities > BIO.CELL .
We will select the L929 cell from the list provided to obtain the query for compounds with benzene substructure with activity data on the L929 cell line.
After pressing the Run Query button we will obtain 47 hits from the Wombat structures entity and their respective child data as shown below.
You can notice that three rows from the Wombat activities entity that are related to the hit structure are displayed and only one of them contains the value “L929 cell”. By default, Instant JChem does not filter child data results and returns all child data related to the parent record even if they do not contain the query definition value. To demonstrate the Filter Search Results function, we will repeat the same search with the filter turned on.
The filter can be turned on by clicking the icon that is found to the right of the search domain dropdown menu both in the query builder and query mode of a view.
Turning the filter on and repeating the query will provide us a restricted result set with 47 structures displaying only one row from the child entity containing the value set in the query definition.
Viewing the associated SQL statements.
When running more complex queries in Instant JChem it can be a useful exercise (particularly for more IT interested users) to examine the SQL statements that are being executed for a given query. In order to do this first you must turn SQL logging on and this can be completed by Tools -> Options and then on the Miscellaneous -> Other settings panel select from the drop down box. After this, it will be necessary to disconnect and again connect to your datababase. Then open the log file window using View -> Instant JChem log file. You can now extract the queries from the output window and if you wish, check them in SQLPLUS (Oracle only) or other RDBMS client tools.
Please note, that SQL statement for same query can be different in different versions of IJC.
Please see the closely related tutorial on list and query management which uses the same examples and explores temporary and permanent queries/lists, list management, domains and useful set operations.
Congratulations
Congratulations! You have learned the basics of querying in Instant JChem
-
Create the demo Project and share some forms.
-
Run some simple grid based queries.
-
Run some simple structure field based query in the form view.
-
Explore the different structure field query types available.
-
Use the Query Builder to execute complex queries.
-
Filter the child data
-
Viewing the associated SQL statements.