June 30, 2007

kfile-chemical/STRIGI -> strigi-chemical

Kfile-chemical had three branches:
  • KDE3, where all chemical analyzers were KFilePlugin's and provided KFileMetaInfo;
  • KDE4, where nothing happened since it was branched;
  • STRIGI, all the metadata extractors were Strigi StreamLineAnalyzers.
I have started my project in kfile-chemical, but it will end up in a different tree. As it was recently proposed by Egon (my GSoC mentor) and confirmed by Jerome (kfile-chemical maintainer) and Jos (Strigi core developer), the STRIGI branch was separated from kfile-chemical.

Now it is called strigi-chemical. The reason is that it has no KDE dependencies at the moment. Strigi-chemical also lives in playground /utils/strigi-chemical/.

The situation at the moment is:
  • kfile-chemical is now what was previously kfile-chemical/KDE3 branch
  • kfile-chemical/KDE4 branch removed
  • strigi-chemical is now what was previously kfile-chemical/STRIGI branch

June 27, 2007

Strigi website

I am not yet addicted to blogging and it is still a pain. It is more like you are preparing the food and every now and then people enter and ask you "what's cooking?". But there actually is much in common between Open Source development and TV cooking show. So this time I will split my reports into small posts: about website, testsuite, analyzers and GUI. Hope it works better this way.

Strigi website has been broken for some weeks. Not that broken, but people could not log in and post updates. For dynamic projects, like Strigi, staying on-air means a lot. And even much more with aKademy 2007 around the corner.

I have spent too much time programming PHP in the last years (I wish that was C++), but I did not expect this experience to be helpful in my chemical GSoC project. Well, I fixed Drupal and now the site is alive again. But there won't be a story without a mystery. In this case it's the mysterious mail service at Sourceforge. Many content management systems, and Drupal is not an exception, want to send mail to the users. By abuse/security reasons Sourceforge web hosting does not provide access to sendmail, nor does it allow outgoing network connections. But there is a workaround.

Sourceforge shell servers and web servers are different machines sharing same disk partitions. Though you can not send any mail from the web servers, you can do it from your account on your Sourceforge shell server. Thus, put your outgoing mail from web server in a queue in mysql database, fetch it regularly by a cron script running on your shell server and feed it to sendmail. It works well, except for the cron: there is crontab on the shell server, but unfortunately, no crond running. Another workaround is to fetch mail from the mysql queue by a cron script running on a remote mailserver. A PHP XML RPC call does the job.

Résumé: moving the website to another hosting is not a bad idea after all. What do you think?

June 19, 2007

Progress report and back on track

I'm happy to get back to hacking this week.
Last two weeks have been almost completely lost due to some urgent reallife issues back in my home country. So I had nothing to do than just to shift my flight and to solve the problems. This was completely unplanned, and left me for two weeks without a single commit, thus making my supervisors nervous about the outcome of my project. Now I can be on the channel, commit daily and blog twice a week, as Egon recommended and as he now insists.

CML2 SAX streamanalyzer in kfile-chemical.
While reading the CML specifications, I thought that there is too much flexibility in it, hence making it hard to parse. To start I took few CML2 samples from Jerom's Chemical Structure 2.0 project which is a part of the BlueObelisk data repository. These files already contain the information, it just had to be extracted. I wrote the analyser based on streamsaxanalyzer. I used xmlindexer and strigicmd tools to see how the analyzer works. I will try to extend the analyzer to support the variety of CML's I can find in the wild. To disribute sample test files together with kfile-chemical I need them to be free/to have a proper license. I am not sure whether the test files from the Chemical MIME project can be included. Please comment on that if you have any clue.

Test suite in kfile-chemical.
I have added a python test suite and the first testcase of 20+ tests is for the CML analyzer. Strigi is intefaced via xmlindexer and strigicmd. I find these command line tools useful for testing, since they do not use any central storage or daemons to work. The test fixtures prepare a clean directory and the list of sample CML files, so that every test in the testcase is executed in a clean environment. For all the test I have used clucene backend. All tests run pretty fast, except for the valgrind test for memory leaks.

The analyzers which are not covered by tests and which are not compliant with current Strigi ontology fieldproperties have been temporary disabled. You can expect most of them to be fixed and enabled back later this week.

CML testcase showed that querying by InChI (chemical.inchi=...) gives me false positives. So there is a question now whether FieldRegister::stringType is suitable to handle exact identifiers like InChI or it is better to make it binary.

I was also wondering why chemistry.name (content.title is its parent) in xmlindexer is turned to content (exactly, not content.title) in strigicmd with clucene backend.

The search by content.version field returns no results and when querying a float molecular weight (chemistry.molecular_weight:58.1222) is gives me no results too.

This leaves me with 3/20 tests failing.

InChI generator.
InChI is uniquely identifying a chemical structure. That is why it is a good idea to have InChI's for all the analyzed chemical files, where possible. OpenBabel can convert any recognised format to InChI strings. I made a working example to see if it is easy and fast enough to generate InChI's in a Strigi streamanalyzer. It is called inchi-generator and it works for valid CML2 files only. I had to buffer the contents of the Strigi stream to pass it to OpenBabel convertor, but I feel there could be a more elegant solution, since OpenBabel works with streams as well, they are just not compatible with Strigi streams.

Linking to OpenBabel.
I had very strange problems with unresolved symbols in OpenBabel format plugins until
Geoffrey helped me. It's all about plugins! Strigi loads streamanalyzers with dlopen() on Linux and so does libopenbabel when it needs a format plugin. The solution was simple, to add RTLD_GLOBAL to code which loads libopenbabel. Since libopenbabel is linked to inchi-generator RTLD_GLOBAL had to be added to Strigi loader. I wonder if it can cause problems to other analyzers. Another solution would be to load libopenbabel from inchi-generator in runtime.

Openbabel 2.1 (SVN) Debian packages
The FindOpenBabel2.cmake script by Carsten is used in KOpenBabel, Kalzium and now in kfile-chemical. It requires --atleast-version=2.1.0. In Debian unstable you can only find version 2.0.2. Michael Banck, Debian maintainer, provides build rules in debichem repository. I do not know what could be the reason for the new version not to be available in SID. Probably it's related to the patches to provide a better version abstraction, e.g. to have two OpenBabel versions installed at the same time.
Anyway, as OB 2.1 is a requirement and can be packaged, I have put the x86 Debian libinchi, libopenbabel and openbabel packages here http://neksa.net/debian/.

Strigi chemical fieldproperties.
Talking to Phreedom in the very beginning of my project, I thought that the chemical fildproperties should represent the minimal set of metadata attributes, but in practice, taking into account the variety of chemical formats, it is hard to define the list once and for all. That is why I have added few other chemistry.fieldproperties. Among these are IUPAC Name, PubChem Compound ID, experimental method of structure elucidation, some physicochemical properties which are to my mind most queried in PDB and few more statistical counters. We better remove some unused properties later rather than keep of storing the metadata which could be extracted.

Further steps.
The test suite has to be expanded to cover all the formats currently existing in kfile-chemical. The analyzers need to be fixed to match current Strigi ontology. This could be done during this week.

Openbabel integration requires more attention, since there is no "magic" MIME detection. I will try to employ Chemical MIME patterns to do the detection. InChI generation is only possible if we know the source format. Strigi is stream-based, hence we can not look at the file extension in streamanalyzers.

And of course, and eye candy, a GUI chemical search tool is one of my deliverables.

I would also love to spend some time on Strigi, perhaps Jos will find the kfile-chemical testsuite good for testing the built-in analyzers.

July 18th -- 25th I will be attending the annual conference of the Society for Computational Biology (ISCB) and the satellite meetings. This time in Vienna, Austria. If you happen to be there at the same time please contact me.

19-20 : 3DSig Structural Bioinformatics and Computational Biophysics meeting.
21 : 3-rd ISCB Student Council Symposium
21-25 : ISMB/ECCB conferences.

My submission has been accepted, so I will be presenting some of the results of my CUBIC project.

One good news from CUBIC. Thanks to the project, my final grade is now A = First Class Honours. I hope this will improve my chance to get a nice place for the PhD.