August 22, 2007

Strigi now extracts chemical information from PNG files

Many people blogged recently about storing molecular connectivity tables in images (Egon summarized it). Strigi-chemical now can extract and index this data.

This is how it works: PngChemicalEndAnalyzer is an endAnalyzer which takes control over the stream. It detects a chemical chunk in PNG (Molfile, CML, InChI, ...) and creates a substream to pass it to indexChild(). Then, again the whole chain of analyzers is executed and chemical data extracted by a respective stream analyzer.

It does not replace a normal PNG endAnalyzer, which is in charge for extracting all image-related information from the stream.

By the way, the InChI analyzer was upgraded and can now detect InChIs in various text sources, it can now fix spaces and in some cases even line breaks.

PNG chemical analyzer has a testcase and let's have a look at file samples and xmlindexer output:

Caffeine with embedded InChI (thanks Jean):



<?xml version='1.0' encoding='UTF-8'?>
<metadata>
<file uri='caffeine.png/Molecule1' mtime='1187122308'>
<value name='system.size'>66</value>
<value name='content.version'>1</value>
<value name='chemistry.inchi'>InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3</value>
<value name='chemistry.molecule_count'>1</value>
<value name='system.depth'>1</value>
<text>InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
</text>
</file>
<file uri='caffeine.png' mtime='1187122308'>
<value name='content.mime_type'>image/png</value>
<value name='system.size'>3323</value>
<value name='chemistry.molecule_count'>1</value>
<value name='content.author'>Jean Brefort</value>
<value name='image.height'>171</value>
<value name='image.width'>193</value>
<value name='image.color_depth'>32</value>
<value name='image.color_space'>RGB/Alpha</value>
<value name='compressed.compression_algorithm'>Deflate</value>
<value name='image.interlace'>None</value>
<value name='content.copyright'>Public domain</value>
<value name='system.depth'>0</value>
</file>
</metadata>

Rosiglitazone with Molfile (thanks Rich):



<?xml version='1.0' encoding='UTF-8'?>
<metadata>
<file uri='rosiglitazone.png/Molecule1' mtime='1185970696'>
<value name='system.size'>2411</value>
<value name='chemistry.name'>name</value>
<value name='chemistry.molecular_formula'>C18N3O3S1</value>
<value name='chemistry.atom_count'>25</value>
<value name='chemistry.bond_count'>27</value>
<value name='content.comment'>comments</value>
<value name='chemistry.chirality'>0</value>
<value name='system.depth'>1</value>
</file>
<file uri='rosiglitazone.png' mtime='1185970696'>
<value name='content.mime_type'>image/png</value>
<value name='system.size'>7984</value>
<value name='chemistry.molecule_count'>1</value>
<value name='image.height'>109</value>
<value name='image.width'>327</value>
<value name='image.color_depth'>32</value>
<value name='image.color_space'>RGB/Alpha</value>
<value name='compressed.compression_algorithm'>Deflate</value>
<value name='image.interlace'>None</value>
<value name='system.depth'>0</value>
</file>

</metadata>

Test files deposited to Blue Obelisk CTFR.

MDL SD file support

Chemical MDL SD files are now powered by advanced KDE/Strigi technologies.

Jstreams is a lightweight C++ streams library and gives us a powerful notion of substreams. Substream providers are very fast and feed Strigi analyzers with data.

One of the main goals of my GSoC project was to powertest Strigi. SD files are good examples of that. They could be really large containers of MOL molecules. It is a natural idea to access them like normal folders with MOLs inside.

I will make a tutorial-like dissection of the implementation here to encourage others to implement support of their favorite file formats in a similar way.

SdfInputStream is providing MOL entries as substreams. The important thing is to make sure it does not mistake any other format for SD. An SdfInputStreamTest testcase checks some basic stream operations.

ArchiveReader is a facility which enables kio_jstream to represent files as directories and dive deep inside archives, email attachments and now SD files. ArchiveReader checks the stream header by calling a subsequent InputStream Provider and if matches it tries to recourse the tree. This is a greedy approach and it hurts. There is probably some space for improvements.

Testcase with a sample file (I have used a 10-compound SD file as a test) can give you the basic idea if your substream provider works or not, but it does not test it as thorough as ArchiveReader does. My major troubles were with ArchiveReader. Once solved you can enjoy the KDE interface access.

I have taken a large, 500 Mb SD file with ~250,000 of compounds and some smaller files of 40, 75, 150 Mb. On the screenshots you can see 40Mb file with ~11,000 of compounds in Dolphin.



Note that not all the entry information is propagated to KIO at the moment, e.g. size of the "directory" is missing. And the interface gradually slows down to an unusable state when trying larger and larger files (1, 2, 3, .. 20 million lines of text). Probably it is not the best idea to put thousands of virtual files in one virtual folder. One possibility is to introduce some virtual subfolders, with <=100 molecules each. Naming is also a problem, because the title is optional in MOL files. I used "MoleculeN" as a name substitution for molecule #N. Another nice test is to read the sub"files" in kwrite. Below are examples of a 10-compound SD in file open dialog and Molecule2 file opened. Of course all screenshots are KDE4, running in Xephyr session in my case.



Now switching to data analysis.

SdfEndAnalyzer uses SdfInputStream to explore SD files, executes an indexChild() per molecule found and stores the number of molecules in chemistry.molecule_count field.

This is all done in Strigi, not in Strigi-chemical because I had some troubles writing and using external substream providers, this could be solved with the help of Jos, hopefully. Since it does not add much overhead, it is not a problem.

indexChild() starts a new chain of analysis and this is where MOL files are indexed. MdlMolFileLineAnalyzer is completely unaware where the data stream comes from, moreover it does not even have direct access to data input stream, it only analyzes the text lines in sequential order. Now it detects MOL signature, makes sure it is not an SD, and collects (and calculates) the chemical meta data, so far: chemistry.name, content.comment, chemistry.molecular_formula, chemistry.atom_count, chemistry.bond_count, chemistry.chirality.

Xmlindexer is a handy command line tool to check what is the outcome. Testing a 10-compound SD file:

xmlindexer ligs3d.sdf

<?xml version='1.0' encoding='UTF-8'?>
<metadata>
<file uri='ligs3d.sdf/Molecule1' mtime='1187611088'>
<value name='system.size'>3609</value>
<value name='chemistry.name'>MFCD02681585</value>
<value name='chemistry.molecular_formula'>C28N4O4</value>
<value name='chemistry.atom_count'>36</value>
<value name='chemistry.bond_count'>39</value>
<value name='chemistry.chirality'>1</value>
<value name='system.depth'>1</value>
</file>
<file uri='ligs3d.sdf/Molecule2' mtime='1187611088'>
<value name='system.size'>3362</value>
<value name='chemistry.name'>FCD01567969</value>
<value name='chemistry.molecular_formula'>C28N3O2</value>
<value name='chemistry.atom_count'>33</value>
<value name='chemistry.bond_count'>37</value>
<value name='chemistry.chirality'>1</value>
<value name='system.depth'>1</value>
</file>

[:skip:]

<file uri='ligs3d.sdf' mtime='1187611088'>
<value name='system.size'>34150</value>
<value name='chemistry.molecule_count'>10</value>
<value name='system.depth'>0</value>
</file>
</metadata>


Test suites, using the sample files from Blue Obelisk Chemical Test File Repository will make sure the analyzers won't be broken in the future: SDFTestCase, MOLTestCase.

August 16, 2007

Strigi-chemical GSoC final timeline

On August 20th all GSoC students and mentors are supposed to start the final evaluation. This means that the major goals of the projects should be complete and working code submitted.

The environment of my Strigi-chemical project is very dynamic: Strigi is under heavy development and undergoing XESAM-reforms at the moment, there is also much happening in open source cheminformatics world, like new OpenBabel release and things like OSRA optical structure recognition or structural information embedded in PNG images. This post outlines the final TODO of my project which you can expect to be ready by the end of this SoC.

  1. JStreams SDF substream provider is a nice feature which represents an SDF file as a virtual folder of MOL files. I'm trying to make it work stable at the moment. This makes SDF analysis transparent as if it really was a set of MOL files. It also allows to browse the contents of SDF with jstream:// KIO;
  2. I need to fix CML2 analyzer and make sure it correctly recognizes the newly generated CMLs from Jerome's chemical-structures-2 repository;
  3. Use MIME helper to decide whether to continue with stream analysis or to skip the stream. It will probably allow to optimize some greedy analyzers;
  4. Make sure that OpenBabel helper works well and does not crash with parallel analyzers;
  5. Create a chemical PNG analyzer to extract chemical information (MOL or InChI) from images; Unittests are based on samples generated by Firefly and gchempaint software;
  6. Make sure all strigi-chemical analyzers conform Strigi PLUGIN architecture and have unittests with files taken from Blue Obelisk CTFR (Chemical Test File Repository)
  7. Update chemical ontology to prepare it for migration to XESAM ontology (fix types, cardinalities, child-parent links and indexing flags);
  8. And finally, build a sample GUI application, using molsKetch and avogadro as Kparts. This application should be able to input a structure by drawing it, represent it as OpenBabel molecule, convert it to InChI using OB, make a XESAM query over dbus to strigidaemon, display the list of results with a structural preview powered by Avogadro (Kalzium).

It is quite a lot of work for 5 days left, but I have drafts of everything listed above so it should be feasible.

August 11, 2007

Optical Structure Recognition in Strigi-chemical?

More people blogged on GPL Optical Structure Recognition tool OSRA (1, 2, 3, 4) since its first release. OSRA is a young project and though it has a poor quality of recognition at the moment, its open license guarantees the bright prospectives.

Journal articles, patent documents, textbooks, etc represent chemical structures as graphics. The idea to have an OCR analyzer to extract chemical structures from graphical files in Strigi-chemical is natural, but even with OSRA there is a long way until a decent implementation. There are obstacles of different kind.

OSRA deployment

At the moment to build an OSRA binary is an effort. It has no automake/autoconf or cmake build system and a long list of compile-time an runtime dependencies. ImageMagick, POTRACE, GOCR and OpenBabel are the major dependencies. Since it is out of the scope of my GSoC, I would have to wait for the upstream maintainer to ship OSRA as a library with an API. From my side I can make a strigi-chemical OSRA-helper (described before) with a runtime optional dependency, but it would take some time to figure out an API first.

Performance

I did some benchmarks with OSRA. It takes 1'3o'' to process the sample patent document:


The general overview of the OCR workflow is as follows:
  • it uses ImageMagick to detect type
  • the PDF and PS files are rendered as images
  • set the resolution, which is fixed (150 dpi) for PDF/PS files
  • it iterates over the pages and detects minimal boxes which most probably contain molecular structures. Here is the first box from the sample patent document:

  • the box is traced to obtain a vector representation
  • atoms, chars, fixed chars are detected
  • bonds are fixed, broken bonds are removed
  • valency-check is performed
  • the structure is converted to SMILES (could be an InChI though):
SMILES: OC(=O)C(C)(C)CCCCOCCCCC(C)(C)C(=O)O
InChI: InChI=1/C16H30O5/c1-15(2,13(17)18)9-5-7-11-21-12-8-6-10-16(3,4)14(19)20/h5-12H2,1-4H3,(H,17,18)(H,19,20)

  • continue with the next box
I tried to see how it scales and created a PDF with 64 compounds. OSRA detected 60 boxes in 1'5" with a quite hight error rate, unfortunately. This benchmark shows that OSRA can't rotate paper well (landscape), recognition of 9 compounds can take much more than 60 compounds and quality of recognition is low.

Simple example from PDF with 64 compounds, the second one is rendered by PubChem using the produced SMILES:




Jos suggested, that we take images from PDF as substreams. There is a substreamprovider for that in Strigi.

I think that taking into account the CPU time required to process a document, we should carefully control and select what we pass to OSRA-helper.

We can make it like that, for example:
  • Create a chemical image analyzer, which would check first whether there is no structural information embedded into the image itself (yes, it is possible, more in my next blogpost),
  • then it should carefully check the context, whether it is a chemical paper (looking at DOI, for example) and
  • then it can pass the extracted substream to OSRA-helper.

Strigi's-eye view on chemistry support (aKademy slides)

In his aKademy 2007 talk Jos van den Oever explained why Strigi is more than searching. I have prepared two slides on strigi-chemical for this presentation. It was definitely not enough time for Jos to give all the details during his talk, so I decided to repeat the two slides here, because I think they give quite good picture of the components, and would be nice supplements to my previous posts on Strigi analyzers.


August 10, 2007

Strigi-chemical analyzers inside

Ten things to have a better understanding of Strigi analyzers:
  1. the only data source is an input stream (think jstreams, not std);
  2. stream providers take care about the embedded substreams, unzip, decode, deflate, etc;
  3. every analyzer gets the stream and takes a bite (the bite is 1024 bytes at the moment);
  4. if it does not like the taste, the analyzer gives the stream back, e.g. a JPEG file does not sound nice to the MP3 analyzer, so it won't take another bite;
  5. analyzers work in parallel, drawback: one analyzer can not use the results of analysis from the other one;
  6. analyzer is parsing the stream and takes the decisions to index some data as fields of the ontology;
  7. ontology describes the (meta)data fields, all the attributes, flags and the hierarchy;
  8. according to ontology field description, indexwriters handle the data passed by the analyzers;
  9. indexing process can be controlled by analyzer configurations, the configuration can be represented as a config file;
  10. analyzers can be distributed as plugins and loaded/included/excluded at runtime
Two problems have to be taken into account:
  1. some greedy analyzers have a bad habit to read every stream till the very end (e.g. cppanalyzer, yes C++ source code), this could stress the performance;
  2. in some cases, the extracted data is not enough and additional meta data has to be calculated or generated. A very simple example would be to count the comments in the C++ source file, a more advanced example: to generate an InChI identifier based on the chemical structure extracted.
If a greedy analyzer performs some calculations and it is slow, this will slow down the whole indexing procedure.

The obvious solution would be to make the analyzers more selective, and if possible -- not greedy. As for the slow analyzers, they should be optional and highly selective about what they process.

Black magic of MIME

Shared-database-info is a specification, a freedesktop.org standard for MIME description databases. Chemical-mime-data follows the same specifications, but adds support for chemical formats. MIME is a contents-based identifier of the stream type. Sometimes MIME is mistakenly assigned by file extension only. The rules in shared-database-info and chemical-mime-data check not only the file extension, but look at the contents of the stream. To tell the MIME type by a few certain bytes in a certain position range is considered to be black magic. A lot of formats, text formats, can not be identified easily with these magic rules.

For some file formats, header checks in analyzers perform exactly the same procedure as MIME checks. For this kind of analyzers it would be a good idea to rely on mimetype. Due to the fact that you can not use the result of one analyzer in another (explained in 10 points above), you can not expect mimetype(s) to be there when you start analysis. Actually, there is a workaround, but it's tricky and potentially unstable.

For chemical formats, it is essential to know exactly the MIME type of the stream. Chemical-mime-data could partly help here. That's why I took the code from MIME type analyzer to make a helper out of it, just like in external library case.

Concerning the greediness of the analyzers, there could be inverse logic involved. Jos has already introduced some optional constraints in analyzerconfiguration to limit the amount of data a certain analyzer can consume: not more than a limited number of bytes. This could help. Those analyzers which do not look for a fingerprint of the file format in the header, could be guided by a negative-control rule: if not an ASCII char is encountered -- stop, if not an UTF8 char found -- stop, i.e. if something unexpected found we stop processing.
Other technique, which is already employed, is to look for the necessary minimum of data and stop analysis. If all the fields we are looking for are already found, why ask for more?

Helpers

The helpers in strigi-chemical are shared libs with thread-safe wrappers above MIME analyzer and libOpenBabel.

the typical workflow is:
  1. detect and check the MIME type;
  2. if does not matches -- stop
  3. if MIME matches or could not be detected -- perform further analysis
  4. look for recognizable data, but do not index it yet
  5. if something strange found, stop and discard the stream, we do not need false positives
  6. if all the data collected -- stop
  7. if something is missing and could be generated without using OpenBabel -- generate it
  8. if something is missing and could be generated by OpenBabel -- call OB helper
  9. add data to index

Strigi plugins, tokenizers and ontology

This is a belated post on changes in Strigi I made in the beginning of July.
The tests in Strigi-chemical have highlighted the problems in Strigi. Below I list problem and its solution.

Problem with plugins.

Strigi analyzers are loaded as plugins with dlopen(). I noticed random and strange effects, when multiple copies of one analyzer have been initialized. The explanation was simple, do not ever use RTLD_GLOBAL flag to load Strigi analyzers. A while ago I blogged about linking to OpenBabel and enabled this flag in Strigi. The solution was to implement a loader for libopenbabel in strigi-chemical analyzer. LibOpenBabel is a highly recommended runtime dependency, although the chemical analyzers can work without it. A question to packagers would be: is it a good idea to specify OB as a dependency or suggested or recommended. There is also an option to make a metapackage with dependency. We don't want the user to miss some key-features just because he overlooked a soft-dependency in apt, for example.

Later, OB loader was wrapped in a singleton with mutex locking. I had some random core dumps of the unit tests due to thread safety violation. I had no idea what could cause it inside OpenBabel, so I just protected the instance.

Problem with tokenizers.

One of the key features of chemical analyzers is to perform an exact structure match search by chemical identifiers. The recommended IUPAC identifier is InChI, although it is not that spread as SMILES, for example. The power of InChI is that it represents a chemical structures with all layers of information (like charges, fixed hydrogens, isotopes and stereochemistry) as a string. And the reverse transformation is possible.
The problem here was that an InChI string was tokenized, i.e. cut into pieces. Few words about how it works. Each token represents a "word" in the dictionary of search engine's backend. Strigi has an indexreader/indexwriter abstraction over the search backends, it can even support hybrid backend, e.g. the mixture of clucene and sqlite. Clucene backend is well supported while the support of other backends is still rudimentary (developers welcome!). So, each field is processed by an analyzer/tokenizer during indexing, and the same tokenizer to be used to process the field during search query analysis. Some tokens, like InChI, deserve a special treatment.

Have a look at the Caffeine InChI to have an impression of what I'm talking about:

InChI=1/C8H10N4O2/ c1-10-4-9-6-5(10)7(13) 12(3)8(14)11(6)2/ h4H,1-3H3

The solution was to add a special flag to chemistry.inchi ontology field property that would indicate that a special tokenizer is required. I added special index control flags, that could tune the behavior of fields in the index (at the moment supported only in clucene backend). These flags are boolean: Binary, Compressed, Indexed, Stored, Tokenized. By default Stored|Indexed|Tokenized are enabled.

Ontology database.

This would not worth a blog post if it would be so simple. Current Strigi uses fieldproperties files as a draft ontology database. The registerField() API did not respect the database and passed cardinality and child-parent relations as call parameters. To make my index control flags working this behavior was changed and the values are now loaded from the database. This left registerField() API call with only one parameter: field name. Loading and control of MinCardinality and MaxCardinality from the database was implemented as well.

Why is the filedproperties database obsolete? Well, XESAM is here and Strigi has to be 100% XESAM compatible. Jos implemented the new dbus XESAM interface, Flavio added new query parser, and Phreedom is hacking an RDF parser for the new ontology database. Add new user query support and you will get a completely XESAM-compatible Strigi (cross-)desktop search engine.

July 9, 2007

Strigi-chemical test suite

The test suite of strigi-chemical deserves special a attention, because it will be probably used for all other Strigi analyzers, moreover it could be useful for Blue Obelisk projects.

The test suite is a set of python scripts using python unittest infrastructure. Suite provides StrigiTestCase as a base class for all test cases. It is a wrapper upon Strigi command line tools strigicmd and xmlindexer and assures that the fixtures have proper isolation. Test runner executes all testcases it can find in the current directory.

Each test focuses on a certain data format. Sample data is very important to have in hand before writing and testing the analyzers. Egon started a project recently to provide a central repository of chemical test files for Blue Obelisk. The problem at the moment is that the repository is incomplete. CALL FOR DATA is announced -- any chemical file with a free license can enter the repository. If you want your chemical data files to be recognized by Strigi, please release samples of your files under an OSI-approved license!

Subversion provides a very nice trick which allows to include external repositories into the project tree. Once set up it requires no further actions. For those developers interested: use svn propset, propget and proplist to manipulate svn:externals property. In our case, after checkout, ctfr will appear as subdirectory in /test:

ctfr http://blueobelisk.svn.sourceforge.net/svnroot/blueobelisk/ctfr/trunk/

TestFileRepository is a python class for generalized access to contents of XML-based CTFR repository. Every testcase inherits from StrigiTestCase an initialized self.ct object, which you can ask to getTOC(), getFiles() or getFleByName() without taking care of the CTFR internals.

Strigi-chemical testcases already helped me to detect problems with text tokenizers, float values and keyword queries. Fortunately, all these problems already found their solutions in the Strigi core. And now, tests won't allow these problems to appear again without been noticed.

June 30, 2007

kfile-chemical/STRIGI -> strigi-chemical

Kfile-chemical had three branches:
  • KDE3, where all chemical analyzers were KFilePlugin's and provided KFileMetaInfo;
  • KDE4, where nothing happened since it was branched;
  • STRIGI, all the metadata extractors were Strigi StreamLineAnalyzers.
I have started my project in kfile-chemical, but it will end up in a different tree. As it was recently proposed by Egon (my GSoC mentor) and confirmed by Jerome (kfile-chemical maintainer) and Jos (Strigi core developer), the STRIGI branch was separated from kfile-chemical.

Now it is called strigi-chemical. The reason is that it has no KDE dependencies at the moment. Strigi-chemical also lives in playground /utils/strigi-chemical/.

The situation at the moment is:
  • kfile-chemical is now what was previously kfile-chemical/KDE3 branch
  • kfile-chemical/KDE4 branch removed
  • strigi-chemical is now what was previously kfile-chemical/STRIGI branch

June 27, 2007

Strigi website

I am not yet addicted to blogging and it is still a pain. It is more like you are preparing the food and every now and then people enter and ask you "what's cooking?". But there actually is much in common between Open Source development and TV cooking show. So this time I will split my reports into small posts: about website, testsuite, analyzers and GUI. Hope it works better this way.

Strigi website has been broken for some weeks. Not that broken, but people could not log in and post updates. For dynamic projects, like Strigi, staying on-air means a lot. And even much more with aKademy 2007 around the corner.

I have spent too much time programming PHP in the last years (I wish that was C++), but I did not expect this experience to be helpful in my chemical GSoC project. Well, I fixed Drupal and now the site is alive again. But there won't be a story without a mystery. In this case it's the mysterious mail service at Sourceforge. Many content management systems, and Drupal is not an exception, want to send mail to the users. By abuse/security reasons Sourceforge web hosting does not provide access to sendmail, nor does it allow outgoing network connections. But there is a workaround.

Sourceforge shell servers and web servers are different machines sharing same disk partitions. Though you can not send any mail from the web servers, you can do it from your account on your Sourceforge shell server. Thus, put your outgoing mail from web server in a queue in mysql database, fetch it regularly by a cron script running on your shell server and feed it to sendmail. It works well, except for the cron: there is crontab on the shell server, but unfortunately, no crond running. Another workaround is to fetch mail from the mysql queue by a cron script running on a remote mailserver. A PHP XML RPC call does the job.

Résumé: moving the website to another hosting is not a bad idea after all. What do you think?

June 19, 2007

Progress report and back on track

Preamble.
I'm happy to get back to hacking this week.
Last two weeks have been almost completely lost due to some urgent reallife issues back in my home country. So I had nothing to do than just to shift my flight and to solve the problems. This was completely unplanned, and left me for two weeks without a single commit, thus making my supervisors nervous about the outcome of my project. Now I can be on the channel, commit daily and blog twice a week, as Egon recommended and as he now insists.

CML2 SAX streamanalyzer in kfile-chemical.
While reading the CML specifications, I thought that there is too much flexibility in it, hence making it hard to parse. To start I took few CML2 samples from Jerom's Chemical Structure 2.0 project which is a part of the BlueObelisk data repository. These files already contain the information, it just had to be extracted. I wrote the analyser based on streamsaxanalyzer. I used xmlindexer and strigicmd tools to see how the analyzer works. I will try to extend the analyzer to support the variety of CML's I can find in the wild. To disribute sample test files together with kfile-chemical I need them to be free/to have a proper license. I am not sure whether the test files from the Chemical MIME project can be included. Please comment on that if you have any clue.

Test suite in kfile-chemical.
I have added a python test suite and the first testcase of 20+ tests is for the CML analyzer. Strigi is intefaced via xmlindexer and strigicmd. I find these command line tools useful for testing, since they do not use any central storage or daemons to work. The test fixtures prepare a clean directory and the list of sample CML files, so that every test in the testcase is executed in a clean environment. For all the test I have used clucene backend. All tests run pretty fast, except for the valgrind test for memory leaks.

The analyzers which are not covered by tests and which are not compliant with current Strigi ontology fieldproperties have been temporary disabled. You can expect most of them to be fixed and enabled back later this week.

CML testcase showed that querying by InChI (chemical.inchi=...) gives me false positives. So there is a question now whether FieldRegister::stringType is suitable to handle exact identifiers like InChI or it is better to make it binary.

I was also wondering why chemistry.name (content.title is its parent) in xmlindexer is turned to content (exactly, not content.title) in strigicmd with clucene backend.

The search by content.version field returns no results and when querying a float molecular weight (chemistry.molecular_weight:58.1222) is gives me no results too.

This leaves me with 3/20 tests failing.

InChI generator.
InChI is uniquely identifying a chemical structure. That is why it is a good idea to have InChI's for all the analyzed chemical files, where possible. OpenBabel can convert any recognised format to InChI strings. I made a working example to see if it is easy and fast enough to generate InChI's in a Strigi streamanalyzer. It is called inchi-generator and it works for valid CML2 files only. I had to buffer the contents of the Strigi stream to pass it to OpenBabel convertor, but I feel there could be a more elegant solution, since OpenBabel works with streams as well, they are just not compatible with Strigi streams.

Linking to OpenBabel.
I had very strange problems with unresolved symbols in OpenBabel format plugins until
Geoffrey helped me. It's all about plugins! Strigi loads streamanalyzers with dlopen() on Linux and so does libopenbabel when it needs a format plugin. The solution was simple, to add RTLD_GLOBAL to code which loads libopenbabel. Since libopenbabel is linked to inchi-generator RTLD_GLOBAL had to be added to Strigi loader. I wonder if it can cause problems to other analyzers. Another solution would be to load libopenbabel from inchi-generator in runtime.

Openbabel 2.1 (SVN) Debian packages
The FindOpenBabel2.cmake script by Carsten is used in KOpenBabel, Kalzium and now in kfile-chemical. It requires --atleast-version=2.1.0. In Debian unstable you can only find version 2.0.2. Michael Banck, Debian maintainer, provides build rules in debichem repository. I do not know what could be the reason for the new version not to be available in SID. Probably it's related to the patches to provide a better version abstraction, e.g. to have two OpenBabel versions installed at the same time.
Anyway, as OB 2.1 is a requirement and can be packaged, I have put the x86 Debian libinchi, libopenbabel and openbabel packages here http://neksa.net/debian/.

Strigi chemical fieldproperties.
Talking to Phreedom in the very beginning of my project, I thought that the chemical fildproperties should represent the minimal set of metadata attributes, but in practice, taking into account the variety of chemical formats, it is hard to define the list once and for all. That is why I have added few other chemistry.fieldproperties. Among these are IUPAC Name, PubChem Compound ID, experimental method of structure elucidation, some physicochemical properties which are to my mind most queried in PDB and few more statistical counters. We better remove some unused properties later rather than keep of storing the metadata which could be extracted.

Further steps.
The test suite has to be expanded to cover all the formats currently existing in kfile-chemical. The analyzers need to be fixed to match current Strigi ontology. This could be done during this week.

Openbabel integration requires more attention, since there is no "magic" MIME detection. I will try to employ Chemical MIME patterns to do the detection. InChI generation is only possible if we know the source format. Strigi is stream-based, hence we can not look at the file extension in streamanalyzers.

And of course, and eye candy, a GUI chemical search tool is one of my deliverables.

I would also love to spend some time on Strigi, perhaps Jos will find the kfile-chemical testsuite good for testing the built-in analyzers.

Offtopic.
July 18th -- 25th I will be attending the annual conference of the Society for Computational Biology (ISCB) and the satellite meetings. This time in Vienna, Austria. If you happen to be there at the same time please contact me.

19-20 : 3DSig Structural Bioinformatics and Computational Biophysics meeting.
21 : 3-rd ISCB Student Council Symposium
21-25 : ISMB/ECCB conferences.

My submission has been accepted, so I will be presenting some of the results of my CUBIC project.

One good news from CUBIC. Thanks to the project, my final grade is now A = First Class Honours. I hope this will improve my chance to get a nice place for the PhD.

May 27, 2007

Introduction

Hello world!

With this post I would like to start tracking the progress of my Google Summer of Code project. The idea of the project is to integrate chemistry and biology knowledge into the KDE desktop. Think of (bio)chemical meta data extraction, indexing and search and this is where you meet Strigi.

Based on the powerful concept of Jstreams, Strigi is a high performance desktop search engine, which is now an inalienable part of KDE4. Strigi has the power to use different backends (clucene, sqlite3, ...) and a simple yet very powerful idea of pluggable stream analyzers. This architecture leads to a very small number of dependencies and places an emphasis on interfaces. The interface is of your choice: link directly, use sockets, dbus or even command line utilities. The main Strigi developers Jos van den Oever and Flavio Castelli already did a great job by providing a stable engine and now Strigi is moving towards the integration with Nepomuk semantic desktop project and Freedesktop.org Xesam specifications.

Nepomuk focuses on meta data ontologies and relations. Sebastian Trueg is the leader of KDE-Nepomuk project and there is also one GSoC student, Dmitriy Soloduhin, involved in it. And thanks to Phreedom (Evgeny Egorochkin), we now have Nepomuk ontologies in Strigi.

Xesam is providing unified api specifications for search and metadata services as a result of collaboration of Freedesktop.org with Strigi, Beagle, Tracker, Pinot, Recoll and Nepomuk-KDE projects.

Now back to chemistry. Blue Obelisk establishes interaction between open projects dealing with chemical systems and cultivates the standards, such as InChI and CML. Blue Obelisk has been born in the US at the ACS meeting, but has many of its roots in the University of Cambridge, group of Peter Murray Rust, and the University of Cologne (CUBIC). Christoph Steinbeck's group at CUBIC brought to life open projects such as CDK, Bioclipse and NMRShiftDB. I am happy that Egon Willighagen, who was the member of Steinbeck's group and is an active contributor to numerous open source projects, is now my mentor and supervisor in this GSoC project.

I was lucky to study Bioinformatics in CUBIC for the last year. I am very excited about my ion channel project which is now over, and I hope to stay with the topic during my PhD studies. By the way, if you have any open PhD positions for bioinformaticians, please let me know.

It is hard to resist the temptation to tell you some interesting facts on ion channels, but returning to the main topic I should tell you about the key projects that are very important for my GSoC project. These are BODR, Chemical MIME, OpenBabel, InChI, CML, chemical structures, Avogadro and Kalzium.

BODR stands for Blue Obelisk Data Repository and is a shared repository for many important chemoinformatics data.

Chemical MIME expands the list of standard MIME types with chemical file formats and provides example files for each format. Daniel Leidert maintains the chemical-mime-data database in Linux distributions. It conforms David Faure's specifications for MIME type databases in KDE4, the automagical type detection relies on it. The file extension is not enough to uniquely identify the MIME type: e.g. ".sdf" stands for SD chemical format and at the same time StarOffice Math Document.

OpenBabel is both a library and command line toolbox, which allow to manipulate chemical data in different formats. It fully supports Chemical MIME. Jerome Pansanel maintains KOpenBabel wrapper (Qt and KDE GUI for OpenBabel converter) and also a large set of molecules in CML format, called Chemical Structures 2.0.1. This is also very important, because CML (Chemical Markup Language) is an XML-based chemical format which is supposed to be the standard.

Some public databases, like PubChem and BODR Chemical Structures, implement InChI identifier, which is an IUPAC standard. InChI allows to represent a chemical structure in an unambiguous way. OpenBabel can generate InChIs from chemical structures. NCI and Kegg databases in CML with InChIs generated can be viewed at NCI and Kegg.

BKchem chemical drawing program by Beda Kosata can regenerate structures from InChIs. Other interesting chemical drawing programs, which at the moment can not import InChIs, are GChemPaint and Molsketch. BKchem uses Tk widgets, and GChemPaint is a part of GNOME desktop. Molsketch by Harm van Eersel is a molecular drawing tool for KDE. If supplied as a KPart, Molsketch can find a bright future in different KDE4 application.

Kalzium is a part of kdeedu, it started as the periodic table of the elements program by Carsten Niehaus and now is gaining momentum and attracting more hackers, who want a better chemistry support in KDE. Kalzium/Avogadro is a 3D molecular visualization library maintained by Benoît Jacob. It uses Eigen, a lightweight linear algebra C++ template library which is already a part of KDE4. Kalzium/Avogadro has acquired another GSoC student -- Marcus Hanwell. The leader of another interesting chemical KDE project KryoMol , Armando Navarro Vázquez, recently has sent the patch to separate Kalzium Molecular Viewer as a KPart.

Kfile-chemical is a project started by Egon and later supported by Jerome and Daniel. Initially it was a set of kfile plugins that allowed chemical meta data extraction. But with the initiative to port kfile plugins to Strigi, kfile-chemical now provides Strigi with chemistry aware stream analyzers. It is hosted in KDE SVN Playground, and since it is aimed to have a low number of dependencies it has the potential to become a part of kdeedu, for example.

Since kfile-chemical is where I make my first efforts, I'll briefly describe what I am doing now and what you can expect by the end of the project.
  1. Make all kfile-chemical analyzers compatible with Strigi/KDE/Nepomuk chemical ontology. This means that there are chemical filed properties defined in Strigi and during the metadata extraction process stream analyzers are supposed to fill in the relevant fields. The chemical field properties at the moment are: chemistry.inchi, chemistry.molecular_formula, chemistry.molecular_weight, chemistry.pdbid, chemistry.xray_resolution. Other properties are supposed to be stored generic field properties, such as content.title and container.items;
  2. Generate InChI (chemistry.inchi) for structures, which do not have it already, using OpenBabel library;
  3. Provide a test suite for the analyzers to make sure nothing breaks when one of the libraries in this mixture is updated;
  4. Expand the list of supported chemical file types to cover as many of Chemical MIMEs as possible.
If it goes smooth, I will try to integrate OSCAR3 to process plain text and create InChIs for molecules found in that text. This will allow indexing and semantic linking of the literature and the chemical files.

On our first meeting, Egon suggested that I should provide a KDE4 GUI chemical search tool, which could possibly be expanded to more generic purposes, like querying abstract field properties from the KDE-Nepomuk ontology. This is also great to test all the technologies and libraries involved. I won't bloat this post with the mockups and screenshots, because it is already quite long, but I will certainly come back to it later this week. So, the idea is to have the following workflow implemented:
  1. While indexing, InChI string is extracted or generated with the help of libOpenBabel by one of the kfile-chemical analyzers;
  2. InChI string is stored in chemistry.inchi field property in Strigi storage;
  3. It can be queried directly by issuing a "chemistry.inchy:" query in strigiclient
  4. The GUI tool can use Molsketch KPart to input the structure with the mouse. The structure is then converteed using OpenBabel to InChI and used as a search key;
  5. The name of the compound, or the synonym, can be specified as a search key;
  6. The search query is sent to Strigi via dbus and the search results received in response;
  7. Search results are either sort of text documents, or the chemical structures. To visualize chemical structures Kalzium/Avogadro KPart can be used.
BTW, is it possible to do a substructure search using InChI, not talking about ignoring some InChI layers?

This project is a good powertesting for all Strigi technologies. But I also hope to be useful to Strigi by extending some functionality and writing unit testcases. And I am sure Jos wont let me go like that :-) also because my primary affiliation here is KDE/Strigi.

I am happy to have this opportunity to work side-by-side with very skilled open source developers and to teach myself a good style, and of course, to have this project right on the intersection of my interests: open source, Linux, KDE and Bio(Chemo)Informatics.

The initial project proposal can be found here.

Now I will tell few words about my current progress. While trying to get all the tools and libraries listed above working on my machine, I was surprised by a crash in KDE/Avogadro, which was caused by a bug in my radeon Mesa DRI drivers. Fortunately I was able to trace the problem and fixed this annoying crash in Avogadro OpenGL initialization. Then I switched to kfile-chemical and started with adapting CML stream analyzer to current standard and ontology. To have my testcases done, I had to introduce passing filters to the command line strigicmd tool. While working on the tests, involving all CML structures from the Chemical Structures 2.0.1, I realized that because the current CML metadata extraction is not XML aware I have to rewrite it using StreamSaxAnalyzer to make it work as desired. This is what I am doing at the moment.

My next steps would be: to generate InChI for CML lacking the identifier. Then I will improve all other available chemical analyzers and create tests for them. Then I will run some productivity tests involving the mirror of the PDB database. And, of course, I will start implementing the GUI chemical query tool.