June 19, 2007

Progress report and back on track

I'm happy to get back to hacking this week.
Last two weeks have been almost completely lost due to some urgent reallife issues back in my home country. So I had nothing to do than just to shift my flight and to solve the problems. This was completely unplanned, and left me for two weeks without a single commit, thus making my supervisors nervous about the outcome of my project. Now I can be on the channel, commit daily and blog twice a week, as Egon recommended and as he now insists.

CML2 SAX streamanalyzer in kfile-chemical.
While reading the CML specifications, I thought that there is too much flexibility in it, hence making it hard to parse. To start I took few CML2 samples from Jerom's Chemical Structure 2.0 project which is a part of the BlueObelisk data repository. These files already contain the information, it just had to be extracted. I wrote the analyser based on streamsaxanalyzer. I used xmlindexer and strigicmd tools to see how the analyzer works. I will try to extend the analyzer to support the variety of CML's I can find in the wild. To disribute sample test files together with kfile-chemical I need them to be free/to have a proper license. I am not sure whether the test files from the Chemical MIME project can be included. Please comment on that if you have any clue.

Test suite in kfile-chemical.
I have added a python test suite and the first testcase of 20+ tests is for the CML analyzer. Strigi is intefaced via xmlindexer and strigicmd. I find these command line tools useful for testing, since they do not use any central storage or daemons to work. The test fixtures prepare a clean directory and the list of sample CML files, so that every test in the testcase is executed in a clean environment. For all the test I have used clucene backend. All tests run pretty fast, except for the valgrind test for memory leaks.

The analyzers which are not covered by tests and which are not compliant with current Strigi ontology fieldproperties have been temporary disabled. You can expect most of them to be fixed and enabled back later this week.

CML testcase showed that querying by InChI (chemical.inchi=...) gives me false positives. So there is a question now whether FieldRegister::stringType is suitable to handle exact identifiers like InChI or it is better to make it binary.

I was also wondering why chemistry.name (content.title is its parent) in xmlindexer is turned to content (exactly, not content.title) in strigicmd with clucene backend.

The search by content.version field returns no results and when querying a float molecular weight (chemistry.molecular_weight:58.1222) is gives me no results too.

This leaves me with 3/20 tests failing.

InChI generator.
InChI is uniquely identifying a chemical structure. That is why it is a good idea to have InChI's for all the analyzed chemical files, where possible. OpenBabel can convert any recognised format to InChI strings. I made a working example to see if it is easy and fast enough to generate InChI's in a Strigi streamanalyzer. It is called inchi-generator and it works for valid CML2 files only. I had to buffer the contents of the Strigi stream to pass it to OpenBabel convertor, but I feel there could be a more elegant solution, since OpenBabel works with streams as well, they are just not compatible with Strigi streams.

Linking to OpenBabel.
I had very strange problems with unresolved symbols in OpenBabel format plugins until
Geoffrey helped me. It's all about plugins! Strigi loads streamanalyzers with dlopen() on Linux and so does libopenbabel when it needs a format plugin. The solution was simple, to add RTLD_GLOBAL to code which loads libopenbabel. Since libopenbabel is linked to inchi-generator RTLD_GLOBAL had to be added to Strigi loader. I wonder if it can cause problems to other analyzers. Another solution would be to load libopenbabel from inchi-generator in runtime.

Openbabel 2.1 (SVN) Debian packages
The FindOpenBabel2.cmake script by Carsten is used in KOpenBabel, Kalzium and now in kfile-chemical. It requires --atleast-version=2.1.0. In Debian unstable you can only find version 2.0.2. Michael Banck, Debian maintainer, provides build rules in debichem repository. I do not know what could be the reason for the new version not to be available in SID. Probably it's related to the patches to provide a better version abstraction, e.g. to have two OpenBabel versions installed at the same time.
Anyway, as OB 2.1 is a requirement and can be packaged, I have put the x86 Debian libinchi, libopenbabel and openbabel packages here http://neksa.net/debian/.

Strigi chemical fieldproperties.
Talking to Phreedom in the very beginning of my project, I thought that the chemical fildproperties should represent the minimal set of metadata attributes, but in practice, taking into account the variety of chemical formats, it is hard to define the list once and for all. That is why I have added few other chemistry.fieldproperties. Among these are IUPAC Name, PubChem Compound ID, experimental method of structure elucidation, some physicochemical properties which are to my mind most queried in PDB and few more statistical counters. We better remove some unused properties later rather than keep of storing the metadata which could be extracted.

Further steps.
The test suite has to be expanded to cover all the formats currently existing in kfile-chemical. The analyzers need to be fixed to match current Strigi ontology. This could be done during this week.

Openbabel integration requires more attention, since there is no "magic" MIME detection. I will try to employ Chemical MIME patterns to do the detection. InChI generation is only possible if we know the source format. Strigi is stream-based, hence we can not look at the file extension in streamanalyzers.

And of course, and eye candy, a GUI chemical search tool is one of my deliverables.

I would also love to spend some time on Strigi, perhaps Jos will find the kfile-chemical testsuite good for testing the built-in analyzers.

July 18th -- 25th I will be attending the annual conference of the Society for Computational Biology (ISCB) and the satellite meetings. This time in Vienna, Austria. If you happen to be there at the same time please contact me.

19-20 : 3DSig Structural Bioinformatics and Computational Biophysics meeting.
21 : 3-rd ISCB Student Council Symposium
21-25 : ISMB/ECCB conferences.

My submission has been accepted, so I will be presenting some of the results of my CUBIC project.

One good news from CUBIC. Thanks to the project, my final grade is now A = First Class Honours. I hope this will improve my chance to get a nice place for the PhD.

1 comment:

Egon Willighagen said...

Hi Alexandr,

here are some first thoughts... I will reply to them tomorrow in more detail.
Thanx for this elaborate posts and for the commits. I am happy to see that you have made good progress!

It might bring it up on the CML discussion list [1], and ask if people can run it against their CML data sets.

Test Suite
Please check out the archives of the Blue Obelisk mailing list; we are trying to set up a test suite there, and I think we should try to integrate this.

Can you elaborate a bit on the false positives for the InChI searching? What query did you use, and what false results show up? Does the indexer split the InChI up into fragments?

About the weight search: have you tried to give a search margin, e.g. >52.1 AND <52.2 ?

InChI generation
I am very happy (and impressed) to see that you got this running already. How do you approach the diversity of file formats? For example, for an MDL sd file the InChI should go in as 'this doc contains info on this compound', while for a simple XYZ file it is more like 'this is the document with identifier'? Or do you think it should always be the first?

New fieldproperties
OK, that's fine.

Next steps
Why not use chemical-mime (also in Debian) to do this [2]?