August 22, 2007

MDL SD file support

Chemical MDL SD files are now powered by advanced KDE/Strigi technologies.

Jstreams is a lightweight C++ streams library and gives us a powerful notion of substreams. Substream providers are very fast and feed Strigi analyzers with data.

One of the main goals of my GSoC project was to powertest Strigi. SD files are good examples of that. They could be really large containers of MOL molecules. It is a natural idea to access them like normal folders with MOLs inside.

I will make a tutorial-like dissection of the implementation here to encourage others to implement support of their favorite file formats in a similar way.

SdfInputStream is providing MOL entries as substreams. The important thing is to make sure it does not mistake any other format for SD. An SdfInputStreamTest testcase checks some basic stream operations.

ArchiveReader is a facility which enables kio_jstream to represent files as directories and dive deep inside archives, email attachments and now SD files. ArchiveReader checks the stream header by calling a subsequent InputStream Provider and if matches it tries to recourse the tree. This is a greedy approach and it hurts. There is probably some space for improvements.

Testcase with a sample file (I have used a 10-compound SD file as a test) can give you the basic idea if your substream provider works or not, but it does not test it as thorough as ArchiveReader does. My major troubles were with ArchiveReader. Once solved you can enjoy the KDE interface access.

I have taken a large, 500 Mb SD file with ~250,000 of compounds and some smaller files of 40, 75, 150 Mb. On the screenshots you can see 40Mb file with ~11,000 of compounds in Dolphin.



Note that not all the entry information is propagated to KIO at the moment, e.g. size of the "directory" is missing. And the interface gradually slows down to an unusable state when trying larger and larger files (1, 2, 3, .. 20 million lines of text). Probably it is not the best idea to put thousands of virtual files in one virtual folder. One possibility is to introduce some virtual subfolders, with <=100 molecules each. Naming is also a problem, because the title is optional in MOL files. I used "MoleculeN" as a name substitution for molecule #N. Another nice test is to read the sub"files" in kwrite. Below are examples of a 10-compound SD in file open dialog and Molecule2 file opened. Of course all screenshots are KDE4, running in Xephyr session in my case.



Now switching to data analysis.

SdfEndAnalyzer uses SdfInputStream to explore SD files, executes an indexChild() per molecule found and stores the number of molecules in chemistry.molecule_count field.

This is all done in Strigi, not in Strigi-chemical because I had some troubles writing and using external substream providers, this could be solved with the help of Jos, hopefully. Since it does not add much overhead, it is not a problem.

indexChild() starts a new chain of analysis and this is where MOL files are indexed. MdlMolFileLineAnalyzer is completely unaware where the data stream comes from, moreover it does not even have direct access to data input stream, it only analyzes the text lines in sequential order. Now it detects MOL signature, makes sure it is not an SD, and collects (and calculates) the chemical meta data, so far: chemistry.name, content.comment, chemistry.molecular_formula, chemistry.atom_count, chemistry.bond_count, chemistry.chirality.

Xmlindexer is a handy command line tool to check what is the outcome. Testing a 10-compound SD file:

xmlindexer ligs3d.sdf

<?xml version='1.0' encoding='UTF-8'?>
<metadata>
<file uri='ligs3d.sdf/Molecule1' mtime='1187611088'>
<value name='system.size'>3609</value>
<value name='chemistry.name'>MFCD02681585</value>
<value name='chemistry.molecular_formula'>C28N4O4</value>
<value name='chemistry.atom_count'>36</value>
<value name='chemistry.bond_count'>39</value>
<value name='chemistry.chirality'>1</value>
<value name='system.depth'>1</value>
</file>
<file uri='ligs3d.sdf/Molecule2' mtime='1187611088'>
<value name='system.size'>3362</value>
<value name='chemistry.name'>FCD01567969</value>
<value name='chemistry.molecular_formula'>C28N3O2</value>
<value name='chemistry.atom_count'>33</value>
<value name='chemistry.bond_count'>37</value>
<value name='chemistry.chirality'>1</value>
<value name='system.depth'>1</value>
</file>

[:skip:]

<file uri='ligs3d.sdf' mtime='1187611088'>
<value name='system.size'>34150</value>
<value name='chemistry.molecule_count'>10</value>
<value name='system.depth'>0</value>
</file>
</metadata>


Test suites, using the sample files from Blue Obelisk Chemical Test File Repository will make sure the analyzers won't be broken in the future: SDFTestCase, MOLTestCase.

No comments: