August 10, 2007

Strigi-chemical analyzers inside

Ten things to have a better understanding of Strigi analyzers:
  1. the only data source is an input stream (think jstreams, not std);
  2. stream providers take care about the embedded substreams, unzip, decode, deflate, etc;
  3. every analyzer gets the stream and takes a bite (the bite is 1024 bytes at the moment);
  4. if it does not like the taste, the analyzer gives the stream back, e.g. a JPEG file does not sound nice to the MP3 analyzer, so it won't take another bite;
  5. analyzers work in parallel, drawback: one analyzer can not use the results of analysis from the other one;
  6. analyzer is parsing the stream and takes the decisions to index some data as fields of the ontology;
  7. ontology describes the (meta)data fields, all the attributes, flags and the hierarchy;
  8. according to ontology field description, indexwriters handle the data passed by the analyzers;
  9. indexing process can be controlled by analyzer configurations, the configuration can be represented as a config file;
  10. analyzers can be distributed as plugins and loaded/included/excluded at runtime
Two problems have to be taken into account:
  1. some greedy analyzers have a bad habit to read every stream till the very end (e.g. cppanalyzer, yes C++ source code), this could stress the performance;
  2. in some cases, the extracted data is not enough and additional meta data has to be calculated or generated. A very simple example would be to count the comments in the C++ source file, a more advanced example: to generate an InChI identifier based on the chemical structure extracted.
If a greedy analyzer performs some calculations and it is slow, this will slow down the whole indexing procedure.

The obvious solution would be to make the analyzers more selective, and if possible -- not greedy. As for the slow analyzers, they should be optional and highly selective about what they process.

Black magic of MIME

Shared-database-info is a specification, a standard for MIME description databases. Chemical-mime-data follows the same specifications, but adds support for chemical formats. MIME is a contents-based identifier of the stream type. Sometimes MIME is mistakenly assigned by file extension only. The rules in shared-database-info and chemical-mime-data check not only the file extension, but look at the contents of the stream. To tell the MIME type by a few certain bytes in a certain position range is considered to be black magic. A lot of formats, text formats, can not be identified easily with these magic rules.

For some file formats, header checks in analyzers perform exactly the same procedure as MIME checks. For this kind of analyzers it would be a good idea to rely on mimetype. Due to the fact that you can not use the result of one analyzer in another (explained in 10 points above), you can not expect mimetype(s) to be there when you start analysis. Actually, there is a workaround, but it's tricky and potentially unstable.

For chemical formats, it is essential to know exactly the MIME type of the stream. Chemical-mime-data could partly help here. That's why I took the code from MIME type analyzer to make a helper out of it, just like in external library case.

Concerning the greediness of the analyzers, there could be inverse logic involved. Jos has already introduced some optional constraints in analyzerconfiguration to limit the amount of data a certain analyzer can consume: not more than a limited number of bytes. This could help. Those analyzers which do not look for a fingerprint of the file format in the header, could be guided by a negative-control rule: if not an ASCII char is encountered -- stop, if not an UTF8 char found -- stop, i.e. if something unexpected found we stop processing.
Other technique, which is already employed, is to look for the necessary minimum of data and stop analysis. If all the fields we are looking for are already found, why ask for more?


The helpers in strigi-chemical are shared libs with thread-safe wrappers above MIME analyzer and libOpenBabel.

the typical workflow is:
  1. detect and check the MIME type;
  2. if does not matches -- stop
  3. if MIME matches or could not be detected -- perform further analysis
  4. look for recognizable data, but do not index it yet
  5. if something strange found, stop and discard the stream, we do not need false positives
  6. if all the data collected -- stop
  7. if something is missing and could be generated without using OpenBabel -- generate it
  8. if something is missing and could be generated by OpenBabel -- call OB helper
  9. add data to index


Stefan said...

Maybe it would be a good idea to generate a simple dependency graph of analyzers, instead of enforcing that all of them run in parallel. This could be done very simply by each analyzer carrying a "depends-on" flag. Most analyzers depend on nothing so they can all be the root of their own tree. Then you can simply run all 'trees' in parallel.

I have worked on a file-indexing system similar to strigi before and this is in fact what we did to organize our analyzer plugins. It worked quite nicely. If interested, there's a write-up of that project
here (starting at page 40)

Either way, keep up the good work!

Alexandr Goncearenco said...

Stefan, I would vote for dependency graph of analyzers, but it seems that other analyzers than chemical do not need these advanced features, and Jos is not happy about this idea. There is no problem to have a graph of dependencies in strigi-chemical, although atm I don't think it will improve the analysis much.

I would rather stay with analyzers without any dependency hierarchies. They are easy to unittest and to debug.

Lots of core functionality is missing and this bothers me more.

Thanks a lot for the paper on semantic filesystem. Personally, I do not support this idea (or the proposed implementation). By projecting a slice of the semantic network to the filesystem you still have all the limitations of filesystem. It is not more useful than "panelize search resuls" option in filemanagers.

RCS-like interface to VFS is nice, but just as a proof-of-principle. I would prefer a combination of semantics-aware Unix tools like find, locate. And network (not tree) management GUI tools integrated into the desktop environment.