Full Text Indexing: Status Update

Blog post by GeneralMaximus on Tue, 2009-06-30 13:18

After more than a week of thinking, "Today is the day I'll write that blog post", here I am with a status update on my HCD2009 project. I have only a few more points to add to what Matt has already posted here.

First of all, the previously unnamed full text indexing and search tool now has a name: Beacon. The indexing daemon currently in the works is called beacond. This is what beacond can do right now:

  • Monitor files for changes and add new/modified files to the index. Only plain text files are supported for now.
  • Handle mounting/unmounting of BFS volumes. Start watching volumes when they are mounted, and stop watching them when they are unmounted.
  • Selectively exclude certain folders from being indexed.

Right now, I'm mostly concerned with polishing beacond. A few short term goals are:

  • Reduce memory usage. Currently, beacond eats up about 60MB of memory, which is way too much for what it does.
  • Perform the actual indexing operation in a separate thread. This is required so that the daemon does not become unresponsive during long indexing operations.
  • Write a small tool which can search the index created by beacond (for demonstration and testing purposes only).
  • Several minor tweaks (properly saving/loading settings, better build system etc.).
  • Write a few DataTranslators so that beacond can be tested with different kinds of files. PDF is top priority.

In the long run, my major goals will be (1) seamlessly integrating Beacon with the existing Find tool in Haiku and (2) supporting more file types. But for now, the focus is on getting the daemon right.

If anybody wishes to check Beacon out, here is the project homepage (hosted on Google Code).

Comments

Re: Full Text Indexing: Status Update

A few hours after this update I finally managed to reduce memory usage from 60MB to around 7MB :)

The problem was caused because beacond was reading entire text files into memory (Rene had already pointed this out a few days back). It took me some time to figure out how to pass a stream to CLucene instead of entire file contents.

Re: Full Text Indexing: Status Update

Sounds like very good progress here!
Do you mean with DataTranslators special CLucene ones or the BeOS style system wide translators? I think especially the text indexing and/or search tool should integrate with the system very well and take advantage of the BFS and translator functionality as most as possible.
Good luck with your project!

Re: Full Text Indexing: Status Update

I mean the Haiku DataTranslators. Writing those would mean all applications benefit :)

Re: Full Text Indexing: Status Update

General Maximus wrote:

I mean the Haiku DataTranslators. Writing those would mean all applications benefit :)

I suppose for the purposes of Beacon these translators would just need to extract the text from PDFs and other files. That should not be too difficult, and should still be generally useful to other applications. I am not sure if you are familiar with the scripting language Ruby, but there is a pretty nice PDF library called Prawn:

http://github.com/sandal/prawn/tree/master

If you can read Ruby this might be easier to read to see just how to extract text. Of course their may already be plenty of C or C++ libraries you could use. But since this is such a simple task maybe it would be better just to write dead-simple and fast code just to extract the text.

That's what I would investigate first.

Re: Full Text Indexing: Status Update

Yes, I've played with Ruby for a while (although I mostly use Python for those quick, late-night scripts that make you hate yourself the next morning).

I'll take a look at that code when I get to the DataTranslators.

Ryan Leavengood wrote:

But since this is such a simple task maybe it would be better just to write dead-simple and fast code just to extract the text.

You're right. I only need to extract plain text from PDF files. A full-blown PDF library might be overkill.