Journal articles, patent documents, textbooks, etc represent chemical structures as graphics. The idea to have an OCR analyzer to extract chemical structures from graphical files in Strigi-chemical is natural, but even with OSRA there is a long way until a decent implementation. There are obstacles of different kind.
At the moment to build an OSRA binary is an effort. It has no automake/autoconf or cmake build system and a long list of compile-time an runtime dependencies. ImageMagick, POTRACE, GOCR and OpenBabel are the major dependencies. Since it is out of the scope of my GSoC, I would have to wait for the upstream maintainer to ship OSRA as a library with an API. From my side I can make a strigi-chemical OSRA-helper (described before) with a runtime optional dependency, but it would take some time to figure out an API first.
I did some benchmarks with OSRA. It takes 1'3o'' to process the sample patent document:
The general overview of the OCR workflow is as follows:
- it uses ImageMagick to detect type
- the PDF and PS files are rendered as images
- set the resolution, which is fixed (150 dpi) for PDF/PS files
- it iterates over the pages and detects minimal boxes which most probably contain molecular structures. Here is the first box from the sample patent document:
- the box is traced to obtain a vector representation
- atoms, chars, fixed chars are detected
- bonds are fixed, broken bonds are removed
- valency-check is performed
- the structure is converted to SMILES (could be an InChI though):
- continue with the next box
Simple example from PDF with 64 compounds, the second one is rendered by PubChem using the produced SMILES:
Jos suggested, that we take images from PDF as substreams. There is a substreamprovider for that in Strigi.
I think that taking into account the CPU time required to process a document, we should carefully control and select what we pass to OSRA-helper.
We can make it like that, for example:
- Create a chemical image analyzer, which would check first whether there is no structural information embedded into the image itself (yes, it is possible, more in my next blogpost),
- then it should carefully check the context, whether it is a chemical paper (looking at DOI, for example) and
- then it can pass the extracted substream to OSRA-helper.