August 22, 2007

Strigi now extracts chemical information from PNG files

Many people blogged recently about storing molecular connectivity tables in images (Egon summarized it). Strigi-chemical now can extract and index this data.

This is how it works: PngChemicalEndAnalyzer is an endAnalyzer which takes control over the stream. It detects a chemical chunk in PNG (Molfile, CML, InChI, ...) and creates a substream to pass it to indexChild(). Then, again the whole chain of analyzers is executed and chemical data extracted by a respective stream analyzer.

It does not replace a normal PNG endAnalyzer, which is in charge for extracting all image-related information from the stream.

By the way, the InChI analyzer was upgraded and can now detect InChIs in various text sources, it can now fix spaces and in some cases even line breaks.

PNG chemical analyzer has a testcase and let's have a look at file samples and xmlindexer output:

Caffeine with embedded InChI (thanks Jean):



<?xml version='1.0' encoding='UTF-8'?>
<metadata>
<file uri='caffeine.png/Molecule1' mtime='1187122308'>
<value name='system.size'>66</value>
<value name='content.version'>1</value>
<value name='chemistry.inchi'>InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3</value>
<value name='chemistry.molecule_count'>1</value>
<value name='system.depth'>1</value>
<text>InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
</text>
</file>
<file uri='caffeine.png' mtime='1187122308'>
<value name='content.mime_type'>image/png</value>
<value name='system.size'>3323</value>
<value name='chemistry.molecule_count'>1</value>
<value name='content.author'>Jean Brefort</value>
<value name='image.height'>171</value>
<value name='image.width'>193</value>
<value name='image.color_depth'>32</value>
<value name='image.color_space'>RGB/Alpha</value>
<value name='compressed.compression_algorithm'>Deflate</value>
<value name='image.interlace'>None</value>
<value name='content.copyright'>Public domain</value>
<value name='system.depth'>0</value>
</file>
</metadata>

Rosiglitazone with Molfile (thanks Rich):



<?xml version='1.0' encoding='UTF-8'?>
<metadata>
<file uri='rosiglitazone.png/Molecule1' mtime='1185970696'>
<value name='system.size'>2411</value>
<value name='chemistry.name'>name</value>
<value name='chemistry.molecular_formula'>C18N3O3S1</value>
<value name='chemistry.atom_count'>25</value>
<value name='chemistry.bond_count'>27</value>
<value name='content.comment'>comments</value>
<value name='chemistry.chirality'>0</value>
<value name='system.depth'>1</value>
</file>
<file uri='rosiglitazone.png' mtime='1185970696'>
<value name='content.mime_type'>image/png</value>
<value name='system.size'>7984</value>
<value name='chemistry.molecule_count'>1</value>
<value name='image.height'>109</value>
<value name='image.width'>327</value>
<value name='image.color_depth'>32</value>
<value name='image.color_space'>RGB/Alpha</value>
<value name='compressed.compression_algorithm'>Deflate</value>
<value name='image.interlace'>None</value>
<value name='system.depth'>0</value>
</file>

</metadata>

Test files deposited to Blue Obelisk CTFR.

2 comments:

Anonymous said...

I am so confused.

How can you extract chemical information from a png image file?

Alexandr Goncearenco said...

It is *not* an optical recognition. Please read my post on OSRA

I can only extract what *is already there*. One of the samples from my post has InChI identifier embedded into PNG, another one has a Molfile entry embedded.

PNG format allows chunks, like tEXt, to serve as containers for any kind of data you may wish to hold there.

If you want the details, just follow the links from the post to check out Egon's overview and the source code of analyzers.