Journal articles, patent documents, textbooks, etc represent chemical structures as graphics. The idea to have an OCR analyzer to extract chemical structures from graphical files in Strigi-chemical is natural, but even with OSRA there is a long way until a decent implementation. There are obstacles of different kind.
OSRA deployment
At the moment to build an OSRA binary is an effort. It has no automake/autoconf or cmake build system and a long list of compile-time an runtime dependencies. ImageMagick, POTRACE, GOCR and OpenBabel are the major dependencies. Since it is out of the scope of my GSoC, I would have to wait for the upstream maintainer to ship OSRA as a library with an API. From my side I can make a strigi-chemical OSRA-helper (described before) with a runtime optional dependency, but it would take some time to figure out an API first.
Performance
I did some benchmarks with OSRA. It takes 1'3o'' to process the sample patent document:
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg63WyDwZxolz8O0Zj6eFp0qWYZwIyldZ5gfu7X5jMNmcZtULCuJvMqOoEDStdZIKjN-xi_e1_ith9y9z0Hh9l0HiYFj5D5sTRPwxe_L32dHzfpnGVxJLWTcgY5uBBFB0trN5X58DT-fw7y/s320/patent.gif)
The general overview of the OCR workflow is as follows:
- it uses ImageMagick to detect type
- the PDF and PS files are rendered as images
- set the resolution, which is fixed (150 dpi) for PDF/PS files
- it iterates over the pages and detects minimal boxes which most probably contain molecular structures. Here is the first box from the sample patent document:
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo2yoYWA8PsisRTeHlAVRJWLhy2SjHNcoILZiltFNmLzABG9sQb0CdzXHp_cjUStKsxEDJL_z_okIp-T565wwor4282mKbi00nM4pLXrHZfTFSdUqNoHJbkYuB1wCwF6YC8bZYgn7_PJDJ/s320/patent0.png)
- the box is traced to obtain a vector representation
- atoms, chars, fixed chars are detected
- bonds are fixed, broken bonds are removed
- valency-check is performed
- the structure is converted to SMILES (could be an InChI though):
InChI: InChI=1/C16H30O5/c1-15(2,13(17)18)9-5-7-11-21-12-8-6-10-16(3,4)14(19)20/h5-12H2,1-4H3,(H,17,18)(H,19,20)
- continue with the next box
Simple example from PDF with 64 compounds, the second one is rendered by PubChem using the produced SMILES:
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9pcE1lfoQnz3-wxJ8glHdN-eEeMadckBHT5GQoUWtJFFGZels9XPOEzcKhNvtGDFxf3CrJE_lpKMDP7qetF9TJPiQir5zRV071iQ4mCRBg_HM8Dw3WuGJg9ohePZ9LGDWxBflSm9yiIij/s320/lig_latex16.png)
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhV76XQolSPdQc-Ke8IH-fHcih2xQK7Z8Cs6HNian_pJciX3ryv_0RnK7xcN5DMvXOg-rPyf1NpEPysBVOIRCn-nv0u0EewJdfFyIGX35F0ma3V9tf2FphLCS56IB3jeGWPpH4kZQw1MLf9/s320/kcnq1_pubchem16.png)
Jos suggested, that we take images from PDF as substreams. There is a substreamprovider for that in Strigi.
I think that taking into account the CPU time required to process a document, we should carefully control and select what we pass to OSRA-helper.
We can make it like that, for example:
- Create a chemical image analyzer, which would check first whether there is no structural information embedded into the image itself (yes, it is possible, more in my next blogpost),
- then it should carefully check the context, whether it is a chemical paper (looking at DOI, for example) and
- then it can pass the extracted substream to OSRA-helper.
No comments:
Post a Comment