August 10, 2007

Strigi plugins, tokenizers and ontology

This is a belated post on changes in Strigi I made in the beginning of July.
The tests in Strigi-chemical have highlighted the problems in Strigi. Below I list problem and its solution.

Problem with plugins.

Strigi analyzers are loaded as plugins with dlopen(). I noticed random and strange effects, when multiple copies of one analyzer have been initialized. The explanation was simple, do not ever use RTLD_GLOBAL flag to load Strigi analyzers. A while ago I blogged about linking to OpenBabel and enabled this flag in Strigi. The solution was to implement a loader for libopenbabel in strigi-chemical analyzer. LibOpenBabel is a highly recommended runtime dependency, although the chemical analyzers can work without it. A question to packagers would be: is it a good idea to specify OB as a dependency or suggested or recommended. There is also an option to make a metapackage with dependency. We don't want the user to miss some key-features just because he overlooked a soft-dependency in apt, for example.

Later, OB loader was wrapped in a singleton with mutex locking. I had some random core dumps of the unit tests due to thread safety violation. I had no idea what could cause it inside OpenBabel, so I just protected the instance.

Problem with tokenizers.

One of the key features of chemical analyzers is to perform an exact structure match search by chemical identifiers. The recommended IUPAC identifier is InChI, although it is not that spread as SMILES, for example. The power of InChI is that it represents a chemical structures with all layers of information (like charges, fixed hydrogens, isotopes and stereochemistry) as a string. And the reverse transformation is possible.
The problem here was that an InChI string was tokenized, i.e. cut into pieces. Few words about how it works. Each token represents a "word" in the dictionary of search engine's backend. Strigi has an indexreader/indexwriter abstraction over the search backends, it can even support hybrid backend, e.g. the mixture of clucene and sqlite. Clucene backend is well supported while the support of other backends is still rudimentary (developers welcome!). So, each field is processed by an analyzer/tokenizer during indexing, and the same tokenizer to be used to process the field during search query analysis. Some tokens, like InChI, deserve a special treatment.

Have a look at the Caffeine InChI to have an impression of what I'm talking about:

InChI=1/C8H10N4O2/ c1-10-4-9-6-5(10)7(13) 12(3)8(14)11(6)2/ h4H,1-3H3

The solution was to add a special flag to chemistry.inchi ontology field property that would indicate that a special tokenizer is required. I added special index control flags, that could tune the behavior of fields in the index (at the moment supported only in clucene backend). These flags are boolean: Binary, Compressed, Indexed, Stored, Tokenized. By default Stored|Indexed|Tokenized are enabled.

Ontology database.

This would not worth a blog post if it would be so simple. Current Strigi uses fieldproperties files as a draft ontology database. The registerField() API did not respect the database and passed cardinality and child-parent relations as call parameters. To make my index control flags working this behavior was changed and the values are now loaded from the database. This left registerField() API call with only one parameter: field name. Loading and control of MinCardinality and MaxCardinality from the database was implemented as well.

Why is the filedproperties database obsolete? Well, XESAM is here and Strigi has to be 100% XESAM compatible. Jos implemented the new dbus XESAM interface, Flavio added new query parser, and Phreedom is hacking an RDF parser for the new ontology database. Add new user query support and you will get a completely XESAM-compatible Strigi (cross-)desktop search engine.

4 comments:

Anonymous said...

I may have ready your RTLD_GLOBAL statement wrong, but are you saying that you *shouldn't* use the GLOBAL flag - i.e use RTLD_LOCAL.

I'd recommend using RTLD_GLOBAL. Although using RTLD_LOCAL may seem tempting, it causes numerous problems with C++. Exceptions, dynamic casts etc. which attempt to compare object types may fail to find a match, when they should because RTLD_LOCAL will cause libraries to be mapped into multiple address spaces. These breaks type comparisons that rely on address comparisons.

Also, singletons (which are evil anyway) within libraries loaded with RTLD_LOCAL will be instantiated for each dlopen(). This may not be what you want.

Using RTLD_GLOBAL avoids these problems and removes ambiguity from the address space.

I'd stay clear of RTLD_LOCAL, especially with C++ exceptions, RTTI and singletons.

Alexander Goncearenco said...

pfee, I was saying that you shouldn't use GLOBAL to load Strigi analyzers plugins. This is the way Strigi plugins export analyzer factories, that the factories could be messed up if you use GLOBAL.

I *am using* RTLF_GLOBAL to load libOpenBabel from the analyzer at runtime. It won't work with LOCAL. To make it clear, the call chain is:

Strigi
-(dlopen LOCAL)->
Plugin with analyzer
-(instantiates)->
a singleton helper
-(dlopen GLOBAL)->
libOpenBabel

Thanks for the hint, pfee, I'm not a C++ expert, I would bring Jos's attention to this issue once again.

Egon Willighagen said...

Alexandr, how many things do you have uncommited? I tried searching for InChI the other day [1], but it failed to find the molecules, even though the InChI was exactly what was listed by the listFields.

1.http://kemistry-desktop.blogspot.com/2007/08/strigi-now-understands-xesam-queries.html

Alexander Goncearenco said...

Egon, I have quite a lot things uncommitted atm, I am stabilizing the changes I introduced.

This should not affect you ability to search by InChI's. I did not have a look at the new XESAM query parser yet, but I bet there is no support for "tokenization" modifiers yet. I added this support to the previous query parser, but not to the new one.

As I mentioned in this post on tokenizers, there should be support in both: indexers and query parsers, and the fieldproperty attributes should be loaded from the database. I think the extended attributes are just ignored by the current version of XESAM query parser.

I will fix that very soon.