August 11, 2007

Optical Structure Recognition in Strigi-chemical?

More people blogged on GPL Optical Structure Recognition tool OSRA (1, 2, 3, 4) since its first release. OSRA is a young project and though it has a poor quality of recognition at the moment, its open license guarantees the bright prospectives.

Journal articles, patent documents, textbooks, etc represent chemical structures as graphics. The idea to have an OCR analyzer to extract chemical structures from graphical files in Strigi-chemical is natural, but even with OSRA there is a long way until a decent implementation. There are obstacles of different kind.

OSRA deployment

At the moment to build an OSRA binary is an effort. It has no automake/autoconf or cmake build system and a long list of compile-time an runtime dependencies. ImageMagick, POTRACE, GOCR and OpenBabel are the major dependencies. Since it is out of the scope of my GSoC, I would have to wait for the upstream maintainer to ship OSRA as a library with an API. From my side I can make a strigi-chemical OSRA-helper (described before) with a runtime optional dependency, but it would take some time to figure out an API first.


I did some benchmarks with OSRA. It takes 1'3o'' to process the sample patent document:

The general overview of the OCR workflow is as follows:
  • it uses ImageMagick to detect type
  • the PDF and PS files are rendered as images
  • set the resolution, which is fixed (150 dpi) for PDF/PS files
  • it iterates over the pages and detects minimal boxes which most probably contain molecular structures. Here is the first box from the sample patent document:

  • the box is traced to obtain a vector representation
  • atoms, chars, fixed chars are detected
  • bonds are fixed, broken bonds are removed
  • valency-check is performed
  • the structure is converted to SMILES (could be an InChI though):
InChI: InChI=1/C16H30O5/c1-15(2,13(17)18)9-5-7-11-21-12-8-6-10-16(3,4)14(19)20/h5-12H2,1-4H3,(H,17,18)(H,19,20)

  • continue with the next box
I tried to see how it scales and created a PDF with 64 compounds. OSRA detected 60 boxes in 1'5" with a quite hight error rate, unfortunately. This benchmark shows that OSRA can't rotate paper well (landscape), recognition of 9 compounds can take much more than 60 compounds and quality of recognition is low.

Simple example from PDF with 64 compounds, the second one is rendered by PubChem using the produced SMILES:

Jos suggested, that we take images from PDF as substreams. There is a substreamprovider for that in Strigi.

I think that taking into account the CPU time required to process a document, we should carefully control and select what we pass to OSRA-helper.

We can make it like that, for example:
  • Create a chemical image analyzer, which would check first whether there is no structural information embedded into the image itself (yes, it is possible, more in my next blogpost),
  • then it should carefully check the context, whether it is a chemical paper (looking at DOI, for example) and
  • then it can pass the extracted substream to OSRA-helper.

