Chemistry and Biology support, KDE/Strigi GSoC project

August 22, 2007

Strigi now extracts chemical information from PNG files

Many people blogged recently about storing molecular connectivity tables in images (Egon summarized it). Strigi-chemical now can extract and index this data.

This is how it works: PngChemicalEndAnalyzer is an endAnalyzer which takes control over the stream. It detects a chemical chunk in PNG (Molfile, CML, InChI, ...) and creates a substream to pass it to indexChild(). Then, again the whole chain of analyzers is executed and chemical data extracted by a respective stream analyzer.

It does not replace a normal PNG endAnalyzer, which is in charge for extracting all image-related information from the stream.

By the way, the InChI analyzer was upgraded and can now detect InChIs in various text sources, it can now fix spaces and in some cases even line breaks.

PNG chemical analyzer has a testcase and let's have a look at file samples and xmlindexer output:

Caffeine with embedded InChI (thanks Jean):

<?xml version='1.0' encoding='UTF-8'?>
<metadata>
<file uri='caffeine.png/Molecule1' mtime='1187122308'>
<value name='system.size'>66</value>
<value name='content.version'>1</value>
<value name='chemistry.inchi'>InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3</value>
<value name='chemistry.molecule_count'>1</value>
<value name='system.depth'>1</value>
<text>InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
</text>
</file>
<file uri='caffeine.png' mtime='1187122308'>
<value name='content.mime_type'>image/png</value>
<value name='system.size'>3323</value>
<value name='chemistry.molecule_count'>1</value>
<value name='content.author'>Jean Brefort</value>
<value name='image.height'>171</value>
<value name='image.width'>193</value>
<value name='image.color_depth'>32</value>
<value name='image.color_space'>RGB/Alpha</value>
<value name='compressed.compression_algorithm'>Deflate</value>
<value name='image.interlace'>None</value>
<value name='content.copyright'>Public domain</value>
<value name='system.depth'>0</value>
</file>
</metadata>

Rosiglitazone with Molfile (thanks Rich):

<?xml version='1.0' encoding='UTF-8'?>
<metadata>
<file uri='rosiglitazone.png/Molecule1' mtime='1185970696'>
<value name='system.size'>2411</value>
<value name='chemistry.name'>name</value>
<value name='chemistry.molecular_formula'>C18N3O3S1</value>
<value name='chemistry.atom_count'>25</value>
<value name='chemistry.bond_count'>27</value>
<value name='content.comment'>comments</value>
<value name='chemistry.chirality'>0</value>
<value name='system.depth'>1</value>
</file>
<file uri='rosiglitazone.png' mtime='1185970696'>
<value name='content.mime_type'>image/png</value>
<value name='system.size'>7984</value>
<value name='chemistry.molecule_count'>1</value>
<value name='image.height'>109</value>
<value name='image.width'>327</value>
<value name='image.color_depth'>32</value>
<value name='image.color_space'>RGB/Alpha</value>
<value name='compressed.compression_algorithm'>Deflate</value>
<value name='image.interlace'>None</value>
<value name='system.depth'>0</value>
</file>
</metadata>

Test files deposited to Blue Obelisk CTFR.

MDL SD file support

Chemical MDL SD files are now powered by advanced KDE/Strigi technologies.

Jstreams is a lightweight C++ streams library and gives us a powerful notion of substreams. Substream providers are very fast and feed Strigi analyzers with data.

One of the main goals of my GSoC project was to powertest Strigi. SD files are good examples of that. They could be really large containers of MOL molecules. It is a natural idea to access them like normal folders with MOLs inside.

I will make a tutorial-like dissection of the implementation here to encourage others to implement support of their favorite file formats in a similar way.

SdfInputStream is providing MOL entries as substreams. The important thing is to make sure it does not mistake any other format for SD. An SdfInputStreamTest testcase checks some basic stream operations.

ArchiveReader is a facility which enables kio_jstream to represent files as directories and dive deep inside archives, email attachments and now SD files. ArchiveReader checks the stream header by calling a subsequent InputStream Provider and if matches it tries to recourse the tree. This is a greedy approach and it hurts. There is probably some space for improvements.

Testcase with a sample file (I have used a 10-compound SD file as a test) can give you the basic idea if your substream provider works or not, but it does not test it as thorough as ArchiveReader does. My major troubles were with ArchiveReader. Once solved you can enjoy the KDE interface access.

I have taken a large, 500 Mb SD file with ~250,000 of compounds and some smaller files of 40, 75, 150 Mb. On the screenshots you can see 40Mb file with ~11,000 of compounds in Dolphin.

Note that not all the entry information is propagated to KIO at the moment, e.g. size of the "directory" is missing. And the interface gradually slows down to an unusable state when trying larger and larger files (1, 2, 3, .. 20 million lines of text). Probably it is not the best idea to put thousands of virtual files in one virtual folder. One possibility is to introduce some virtual subfolders, with <=100 molecules each. Naming is also a problem, because the title is optional in MOL files. I used "MoleculeN" as a name substitution for molecule #N. Another nice test is to read the sub"files" in kwrite. Below are examples of a 10-compound SD in file open dialog and Molecule2 file opened. Of course all screenshots are KDE4, running in Xephyr session in my case.

Now switching to data analysis.

SdfEndAnalyzer uses SdfInputStream to explore SD files, executes an indexChild() per molecule found and stores the number of molecules in chemistry.molecule_count field.

This is all done in Strigi, not in Strigi-chemical because I had some troubles writing and using external substream providers, this could be solved with the help of Jos, hopefully. Since it does not add much overhead, it is not a problem.

indexChild() starts a new chain of analysis and this is where MOL files are indexed. MdlMolFileLineAnalyzer is completely unaware where the data stream comes from, moreover it does not even have direct access to data input stream, it only analyzes the text lines in sequential order. Now it detects MOL signature, makes sure it is not an SD, and collects (and calculates) the chemical meta data, so far: chemistry.name, content.comment, chemistry.molecular_formula, chemistry.atom_count, chemistry.bond_count, chemistry.chirality.

Xmlindexer is a handy command line tool to check what is the outcome. Testing a 10-compound SD file:

xmlindexer ligs3d.sdf

<?xml version='1.0' encoding='UTF-8'?>
<metadata>
<file uri='ligs3d.sdf/Molecule1' mtime='1187611088'>
<value name='system.size'>3609</value>
<value name='chemistry.name'>MFCD02681585</value>
<value name='chemistry.molecular_formula'>C28N4O4</value>
<value name='chemistry.atom_count'>36</value>
<value name='chemistry.bond_count'>39</value>
<value name='chemistry.chirality'>1</value>
<value name='system.depth'>1</value>
</file>
<file uri='ligs3d.sdf/Molecule2' mtime='1187611088'>
<value name='system.size'>3362</value>
<value name='chemistry.name'>FCD01567969</value>
<value name='chemistry.molecular_formula'>C28N3O2</value>
<value name='chemistry.atom_count'>33</value>
<value name='chemistry.bond_count'>37</value>
<value name='chemistry.chirality'>1</value>
<value name='system.depth'>1</value>
</file>

[:skip:]

<file uri='ligs3d.sdf' mtime='1187611088'>
<value name='system.size'>34150</value>
<value name='chemistry.molecule_count'>10</value>
<value name='system.depth'>0</value>
</file>
</metadata>

Test suites, using the sample files from Blue Obelisk Chemical Test File Repository will make sure the analyzers won't be broken in the future: SDFTestCase, MOLTestCase.

August 16, 2007

Strigi-chemical GSoC final timeline

On August 20th all GSoC students and mentors are supposed to start the final evaluation. This means that the major goals of the projects should be complete and working code submitted.

The environment of my Strigi-chemical project is very dynamic: Strigi is under heavy development and undergoing XESAM-reforms at the moment, there is also much happening in open source cheminformatics world, like new OpenBabel release and things like OSRA optical structure recognition or structural information embedded in PNG images. This post outlines the final TODO of my project which you can expect to be ready by the end of this SoC.

JStreams SDF substream provider is a nice feature which represents an SDF file as a virtual folder of MOL files. I'm trying to make it work stable at the moment. This makes SDF analysis transparent as if it really was a set of MOL files. It also allows to browse the contents of SDF with jstream:// KIO;
I need to fix CML2 analyzer and make sure it correctly recognizes the newly generated CMLs from Jerome's chemical-structures-2 repository;
Use MIME helper to decide whether to continue with stream analysis or to skip the stream. It will probably allow to optimize some greedy analyzers;
Make sure that OpenBabel helper works well and does not crash with parallel analyzers;
Create a chemical PNG analyzer to extract chemical information (MOL or InChI) from images; Unittests are based on samples generated by Firefly and gchempaint software;
Make sure all strigi-chemical analyzers conform Strigi PLUGIN architecture and have unittests with files taken from Blue Obelisk CTFR (Chemical Test File Repository)
Update chemical ontology to prepare it for migration to XESAM ontology (fix types, cardinalities, child-parent links and indexing flags);
And finally, build a sample GUI application, using molsKetch and avogadro as Kparts. This application should be able to input a structure by drawing it, represent it as OpenBabel molecule, convert it to InChI using OB, make a XESAM query over dbus to strigidaemon, display the list of results with a structural preview powered by Avogadro (Kalzium).

It is quite a lot of work for 5 days left, but I have drafts of everything listed above so it should be feasible.

August 11, 2007

Optical Structure Recognition in Strigi-chemical?

More people blogged on GPL Optical Structure Recognition tool OSRA (1, 2, 3, 4) since its first release. OSRA is a young project and though it has a poor quality of recognition at the moment, its open license guarantees the bright prospectives.

Journal articles, patent documents, textbooks, etc represent chemical structures as graphics. The idea to have an OCR analyzer to extract chemical structures from graphical files in Strigi-chemical is natural, but even with OSRA there is a long way until a decent implementation. There are obstacles of different kind.

OSRA deployment

At the moment to build an OSRA binary is an effort. It has no automake/autoconf or cmake build system and a long list of compile-time an runtime dependencies. ImageMagick, POTRACE, GOCR and OpenBabel are the major dependencies. Since it is out of the scope of my GSoC, I would have to wait for the upstream maintainer to ship OSRA as a library with an API. From my side I can make a strigi-chemical OSRA-helper (described before) with a runtime optional dependency, but it would take some time to figure out an API first.

Performance

I did some benchmarks with OSRA. It takes 1'3o'' to process the sample patent document:

The general overview of the OCR workflow is as follows:

it uses ImageMagick to detect type
the PDF and PS files are rendered as images
set the resolution, which is fixed (150 dpi) for PDF/PS files
it iterates over the pages and detects minimal boxes which most probably contain molecular structures. Here is the first box from the sample patent document:

the box is traced to obtain a vector representation
atoms, chars, fixed chars are detected
bonds are fixed, broken bonds are removed
valency-check is performed
the structure is converted to SMILES (could be an InChI though):

SMILES: OC(=O)C(C)(C)CCCCOCCCCC(C)(C)C(=O)O
InChI: InChI=1/C16H30O5/c1-15(2,13(17)18)9-5-7-11-21-12-8-6-10-16(3,4)14(19)20/h5-12H2,1-4H3,(H,17,18)(H,19,20)

continue with the next box

I tried to see how it scales and created a PDF with 64 compounds. OSRA detected 60 boxes in 1'5" with a quite hight error rate, unfortunately. This benchmark shows that OSRA can't rotate paper well (landscape), recognition of 9 compounds can take much more than 60 compounds and quality of recognition is low.

Simple example from PDF with 64 compounds, the second one is rendered by PubChem using the produced SMILES:

Jos suggested, that we take images from PDF as substreams. There is a substreamprovider for that in Strigi.

I think that taking into account the CPU time required to process a document, we should carefully control and select what we pass to OSRA-helper.

We can make it like that, for example:

Create a chemical image analyzer, which would check first whether there is no structural information embedded into the image itself (yes, it is possible, more in my next blogpost),
then it should carefully check the context, whether it is a chemical paper (looking at DOI, for example) and
then it can pass the extracted substream to OSRA-helper.

Strigi's-eye view on chemistry support (aKademy slides)

In his aKademy 2007 talk Jos van den Oever explained why Strigi is more than searching. I have prepared two slides on strigi-chemical for this presentation. It was definitely not enough time for Jos to give all the details during his talk, so I decided to repeat the two slides here, because I think they give quite good picture of the components, and would be nice supplements to my previous posts on Strigi analyzers.

August 10, 2007

Strigi-chemical analyzers inside

Ten things to have a better understanding of Strigi analyzers:

the only data source is an input stream (think jstreams, not std);
stream providers take care about the embedded substreams, unzip, decode, deflate, etc;
every analyzer gets the stream and takes a bite (the bite is 1024 bytes at the moment);
if it does not like the taste, the analyzer gives the stream back, e.g. a JPEG file does not sound nice to the MP3 analyzer, so it won't take another bite;
analyzers work in parallel, drawback: one analyzer can not use the results of analysis from the other one;
analyzer is parsing the stream and takes the decisions to index some data as fields of the ontology;
ontology describes the (meta)data fields, all the attributes, flags and the hierarchy;
according to ontology field description, indexwriters handle the data passed by the analyzers;
indexing process can be controlled by analyzer configurations, the configuration can be represented as a config file;
analyzers can be distributed as plugins and loaded/included/excluded at runtime

Two problems have to be taken into account:

some greedy analyzers have a bad habit to read every stream till the very end (e.g. cppanalyzer, yes C++ source code), this could stress the performance;
in some cases, the extracted data is not enough and additional meta data has to be calculated or generated. A very simple example would be to count the comments in the C++ source file, a more advanced example: to generate an InChI identifier based on the chemical structure extracted.

If a greedy analyzer performs some calculations and it is slow, this will slow down the whole indexing procedure.

The obvious solution would be to make the analyzers more selective, and if possible -- not greedy. As for the slow analyzers, they should be optional and highly selective about what they process.

Black magic of MIME

Shared-database-info is a specification, a freedesktop.org standard for MIME description databases. Chemical-mime-data follows the same specifications, but adds support for chemical formats. MIME is a contents-based identifier of the stream type. Sometimes MIME is mistakenly assigned by file extension only. The rules in shared-database-info and chemical-mime-data check not only the file extension, but look at the contents of the stream. To tell the MIME type by a few certain bytes in a certain position range is considered to be black magic. A lot of formats, text formats, can not be identified easily with these magic rules.

For some file formats, header checks in analyzers perform exactly the same procedure as MIME checks. For this kind of analyzers it would be a good idea to rely on mimetype. Due to the fact that you can not use the result of one analyzer in another (explained in 10 points above), you can not expect mimetype(s) to be there when you start analysis. Actually, there is a workaround, but it's tricky and potentially unstable.

For chemical formats, it is essential to know exactly the MIME type of the stream. Chemical-mime-data could partly help here. That's why I took the code from MIME type analyzer to make a helper out of it, just like in external library case.

Concerning the greediness of the analyzers, there could be inverse logic involved. Jos has already introduced some optional constraints in analyzerconfiguration to limit the amount of data a certain analyzer can consume: not more than a limited number of bytes. This could help. Those analyzers which do not look for a fingerprint of the file format in the header, could be guided by a negative-control rule: if not an ASCII char is encountered -- stop, if not an UTF8 char found -- stop, i.e. if something unexpected found we stop processing.
Other technique, which is already employed, is to look for the necessary minimum of data and stop analysis. If all the fields we are looking for are already found, why ask for more?

Helpers

The helpers in strigi-chemical are shared libs with thread-safe wrappers above MIME analyzer and libOpenBabel.

the typical workflow is:

detect and check the MIME type;
if does not matches -- stop
if MIME matches or could not be detected -- perform further analysis
look for recognizable data, but do not index it yet
if something strange found, stop and discard the stream, we do not need false positives
if all the data collected -- stop
if something is missing and could be generated without using OpenBabel -- generate it
if something is missing and could be generated by OpenBabel -- call OB helper
add data to index

Strigi plugins, tokenizers and ontology

This is a belated post on changes in Strigi I made in the beginning of July.
The tests in Strigi-chemical have highlighted the problems in Strigi. Below I list problem and its solution.

Problem with plugins.

Strigi analyzers are loaded as plugins with dlopen(). I noticed random and strange effects, when multiple copies of one analyzer have been initialized. The explanation was simple, do not ever use RTLD_GLOBAL flag to load Strigi analyzers. A while ago I blogged about linking to OpenBabel and enabled this flag in Strigi. The solution was to implement a loader for libopenbabel in strigi-chemical analyzer. LibOpenBabel is a highly recommended runtime dependency, although the chemical analyzers can work without it. A question to packagers would be: is it a good idea to specify OB as a dependency or suggested or recommended. There is also an option to make a metapackage with dependency. We don't want the user to miss some key-features just because he overlooked a soft-dependency in apt, for example.

Later, OB loader was wrapped in a singleton with mutex locking. I had some random core dumps of the unit tests due to thread safety violation. I had no idea what could cause it inside OpenBabel, so I just protected the instance.

Problem with tokenizers.

One of the key features of chemical analyzers is to perform an exact structure match search by chemical identifiers. The recommended IUPAC identifier is InChI, although it is not that spread as SMILES, for example. The power of InChI is that it represents a chemical structures with all layers of information (like charges, fixed hydrogens, isotopes and stereochemistry) as a string. And the reverse transformation is possible.
The problem here was that an InChI string was tokenized, i.e. cut into pieces. Few words about how it works. Each token represents a "word" in the dictionary of search engine's backend. Strigi has an indexreader/indexwriter abstraction over the search backends, it can even support hybrid backend, e.g. the mixture of clucene and sqlite. Clucene backend is well supported while the support of other backends is still rudimentary (developers welcome!). So, each field is processed by an analyzer/tokenizer during indexing, and the same tokenizer to be used to process the field during search query analysis. Some tokens, like InChI, deserve a special treatment.

Have a look at the Caffeine InChI to have an impression of what I'm talking about:

InChI=1/C8H10N4O2/ c1-10-4-9-6-5(10)7(13) 12(3)8(14)11(6)2/ h4H,1-3H3

The solution was to add a special flag to chemistry.inchi ontology field property that would indicate that a special tokenizer is required. I added special index control flags, that could tune the behavior of fields in the index (at the moment supported only in clucene backend). These flags are boolean: Binary, Compressed, Indexed, Stored, Tokenized. By default Stored|Indexed|Tokenized are enabled.

Ontology database.

This would not worth a blog post if it would be so simple. Current Strigi uses fieldproperties files as a draft ontology database. The registerField() API did not respect the database and passed cardinality and child-parent relations as call parameters. To make my index control flags working this behavior was changed and the values are now loaded from the database. This left registerField() API call with only one parameter: field name. Loading and control of MinCardinality and MaxCardinality from the database was implemented as well.

Why is the filedproperties database obsolete? Well, XESAM is here and Strigi has to be 100% XESAM compatible. Jos implemented the new dbus XESAM interface, Flavio added new query parser, and Phreedom is hacking an RDF parser for the new ontology database. Add new user query support and you will get a completely XESAM-compatible Strigi (cross-)desktop search engine.