May 27, 2007

Introduction

Hello world!

With this post I would like to start tracking the progress of my Google Summer of Code project. The idea of the project is to integrate chemistry and biology knowledge into the KDE desktop. Think of (bio)chemical meta data extraction, indexing and search and this is where you meet Strigi.

Based on the powerful concept of Jstreams, Strigi is a high performance desktop search engine, which is now an inalienable part of KDE4. Strigi has the power to use different backends (clucene, sqlite3, ...) and a simple yet very powerful idea of pluggable stream analyzers. This architecture leads to a very small number of dependencies and places an emphasis on interfaces. The interface is of your choice: link directly, use sockets, dbus or even command line utilities. The main Strigi developers Jos van den Oever and Flavio Castelli already did a great job by providing a stable engine and now Strigi is moving towards the integration with Nepomuk semantic desktop project and Freedesktop.org Xesam specifications.

Nepomuk focuses on meta data ontologies and relations. Sebastian Trueg is the leader of KDE-Nepomuk project and there is also one GSoC student, Dmitriy Soloduhin, involved in it. And thanks to Phreedom (Evgeny Egorochkin), we now have Nepomuk ontologies in Strigi.

Xesam is providing unified api specifications for search and metadata services as a result of collaboration of Freedesktop.org with Strigi, Beagle, Tracker, Pinot, Recoll and Nepomuk-KDE projects.

Now back to chemistry. Blue Obelisk establishes interaction between open projects dealing with chemical systems and cultivates the standards, such as InChI and CML. Blue Obelisk has been born in the US at the ACS meeting, but has many of its roots in the University of Cambridge, group of Peter Murray Rust, and the University of Cologne (CUBIC). Christoph Steinbeck's group at CUBIC brought to life open projects such as CDK, Bioclipse and NMRShiftDB. I am happy that Egon Willighagen, who was the member of Steinbeck's group and is an active contributor to numerous open source projects, is now my mentor and supervisor in this GSoC project.

I was lucky to study Bioinformatics in CUBIC for the last year. I am very excited about my ion channel project which is now over, and I hope to stay with the topic during my PhD studies. By the way, if you have any open PhD positions for bioinformaticians, please let me know.

It is hard to resist the temptation to tell you some interesting facts on ion channels, but returning to the main topic I should tell you about the key projects that are very important for my GSoC project. These are BODR, Chemical MIME, OpenBabel, InChI, CML, chemical structures, Avogadro and Kalzium.

BODR stands for Blue Obelisk Data Repository and is a shared repository for many important chemoinformatics data.

Chemical MIME expands the list of standard MIME types with chemical file formats and provides example files for each format. Daniel Leidert maintains the chemical-mime-data database in Linux distributions. It conforms David Faure's specifications for MIME type databases in KDE4, the automagical type detection relies on it. The file extension is not enough to uniquely identify the MIME type: e.g. ".sdf" stands for SD chemical format and at the same time StarOffice Math Document.

OpenBabel is both a library and command line toolbox, which allow to manipulate chemical data in different formats. It fully supports Chemical MIME. Jerome Pansanel maintains KOpenBabel wrapper (Qt and KDE GUI for OpenBabel converter) and also a large set of molecules in CML format, called Chemical Structures 2.0.1. This is also very important, because CML (Chemical Markup Language) is an XML-based chemical format which is supposed to be the standard.

Some public databases, like PubChem and BODR Chemical Structures, implement InChI identifier, which is an IUPAC standard. InChI allows to represent a chemical structure in an unambiguous way. OpenBabel can generate InChIs from chemical structures. NCI and Kegg databases in CML with InChIs generated can be viewed at NCI and Kegg.

BKchem chemical drawing program by Beda Kosata can regenerate structures from InChIs. Other interesting chemical drawing programs, which at the moment can not import InChIs, are GChemPaint and Molsketch. BKchem uses Tk widgets, and GChemPaint is a part of GNOME desktop. Molsketch by Harm van Eersel is a molecular drawing tool for KDE. If supplied as a KPart, Molsketch can find a bright future in different KDE4 application.

Kalzium is a part of kdeedu, it started as the periodic table of the elements program by Carsten Niehaus and now is gaining momentum and attracting more hackers, who want a better chemistry support in KDE. Kalzium/Avogadro is a 3D molecular visualization library maintained by Benoît Jacob. It uses Eigen, a lightweight linear algebra C++ template library which is already a part of KDE4. Kalzium/Avogadro has acquired another GSoC student -- Marcus Hanwell. The leader of another interesting chemical KDE project KryoMol , Armando Navarro Vázquez, recently has sent the patch to separate Kalzium Molecular Viewer as a KPart.

Kfile-chemical is a project started by Egon and later supported by Jerome and Daniel. Initially it was a set of kfile plugins that allowed chemical meta data extraction. But with the initiative to port kfile plugins to Strigi, kfile-chemical now provides Strigi with chemistry aware stream analyzers. It is hosted in KDE SVN Playground, and since it is aimed to have a low number of dependencies it has the potential to become a part of kdeedu, for example.

Since kfile-chemical is where I make my first efforts, I'll briefly describe what I am doing now and what you can expect by the end of the project.
  1. Make all kfile-chemical analyzers compatible with Strigi/KDE/Nepomuk chemical ontology. This means that there are chemical filed properties defined in Strigi and during the metadata extraction process stream analyzers are supposed to fill in the relevant fields. The chemical field properties at the moment are: chemistry.inchi, chemistry.molecular_formula, chemistry.molecular_weight, chemistry.pdbid, chemistry.xray_resolution. Other properties are supposed to be stored generic field properties, such as content.title and container.items;
  2. Generate InChI (chemistry.inchi) for structures, which do not have it already, using OpenBabel library;
  3. Provide a test suite for the analyzers to make sure nothing breaks when one of the libraries in this mixture is updated;
  4. Expand the list of supported chemical file types to cover as many of Chemical MIMEs as possible.
If it goes smooth, I will try to integrate OSCAR3 to process plain text and create InChIs for molecules found in that text. This will allow indexing and semantic linking of the literature and the chemical files.

On our first meeting, Egon suggested that I should provide a KDE4 GUI chemical search tool, which could possibly be expanded to more generic purposes, like querying abstract field properties from the KDE-Nepomuk ontology. This is also great to test all the technologies and libraries involved. I won't bloat this post with the mockups and screenshots, because it is already quite long, but I will certainly come back to it later this week. So, the idea is to have the following workflow implemented:
  1. While indexing, InChI string is extracted or generated with the help of libOpenBabel by one of the kfile-chemical analyzers;
  2. InChI string is stored in chemistry.inchi field property in Strigi storage;
  3. It can be queried directly by issuing a "chemistry.inchy:" query in strigiclient
  4. The GUI tool can use Molsketch KPart to input the structure with the mouse. The structure is then converteed using OpenBabel to InChI and used as a search key;
  5. The name of the compound, or the synonym, can be specified as a search key;
  6. The search query is sent to Strigi via dbus and the search results received in response;
  7. Search results are either sort of text documents, or the chemical structures. To visualize chemical structures Kalzium/Avogadro KPart can be used.
BTW, is it possible to do a substructure search using InChI, not talking about ignoring some InChI layers?

This project is a good powertesting for all Strigi technologies. But I also hope to be useful to Strigi by extending some functionality and writing unit testcases. And I am sure Jos wont let me go like that :-) also because my primary affiliation here is KDE/Strigi.

I am happy to have this opportunity to work side-by-side with very skilled open source developers and to teach myself a good style, and of course, to have this project right on the intersection of my interests: open source, Linux, KDE and Bio(Chemo)Informatics.

The initial project proposal can be found here.

Now I will tell few words about my current progress. While trying to get all the tools and libraries listed above working on my machine, I was surprised by a crash in KDE/Avogadro, which was caused by a bug in my radeon Mesa DRI drivers. Fortunately I was able to trace the problem and fixed this annoying crash in Avogadro OpenGL initialization. Then I switched to kfile-chemical and started with adapting CML stream analyzer to current standard and ontology. To have my testcases done, I had to introduce passing filters to the command line strigicmd tool. While working on the tests, involving all CML structures from the Chemical Structures 2.0.1, I realized that because the current CML metadata extraction is not XML aware I have to rewrite it using StreamSaxAnalyzer to make it work as desired. This is what I am doing at the moment.

My next steps would be: to generate InChI for CML lacking the identifier. Then I will improve all other available chemical analyzers and create tests for them. Then I will run some productivity tests involving the mirror of the PDB database. And, of course, I will start implementing the GUI chemical query tool.

2 comments:

Egon Willighagen said...

Looks good!

Egon Willighagen said...

Alexandr, as easier to realize alternative you could consider using OSRA [1] to recognize molecules via drawings. The dependencies for those are much easier, and it is written in C++.

1.http://chem-bla-ics.blogspot.com/2007/07/optical-chemical-structure-recognition.html