Creating a Search Index with Zend_Search_Lucene

I suspect people commonly underestimate what’s required of a good Web site search engine. Some developers probably think that you just create a search box and then use the supplied terms in a database query to get the results. But there are actually three aspects to a search engine:

The index of the content to be searched
The act of actually performing the search
The reporting of the search results

Many people, I believe, only really think about these last two, but it’s really the index that’s key to the success of any search engine, just like a good index at the back of a book makes it possible for a reader to quickly find what they’re looking for.

\[intlink id="790" type="post"\]\[/intlink\]

, I explained how to integrate Zend_Search_Lucene into a Yii-based Web application. The focus in that post is really on getting the two different frameworks to work together. This is easily accomplished for two reasons:

Yii supports third-party tools nicely
The Zend Framework can be used piecemeal

So that previous post on Yii and Zend_Search_Lucene walks you through the Yii Controller and View files you’d create to perform a search and report upon the results, something that Zend_Search_Lucene does easily. Creating the index itself is the actual challenge, then, and one that I don’t feel is adequately documented elsewhere. In this post, I explain how to use Zend_Search_Lucene to create a search index of a site.

There are different ways you can create an index of any Web site. One option is to use a spider: provide it with a starting URL and it will then crawl through the site, indexing content, noting and following links, etc. Unfortunately—and this is something that’s not obvious to those new to Lucene, Lucene is not a spider and cannot crawl your site (the same goes for Zend_Search_Lucene). Lucene is simply a search engine. One option, which I have played around with, is to use a separate crawler that will create a Lucene-compatible index for you. The most popular such tool is perhaps Nutch, also from Apache. Nutch is an excellent crawler, but it’s written in Java and getting it to output Lucene-compatible indexes isn’t easy. I worked on that for some time and often felt like I was spinning my wheels (in part because I have no formal Java training). In the end, I decided to go another route…

Running a spider through a site (over HTTP) makes sense particularly when no internal access to the content is available. However, with a database-driven site, some of the site’s content can be discovered in another, easier way: in the database itself! In fact, unless a site has static pages, probably all of the site’s content can be found in the database. From database queries alone, you can generate a custom index. I’ll explain how to do that, but stepping back a bit…

\[php\]\[/php\]

Yii’s runtime directory is where the Yii application stores information as the site runs, such as logs, so it’sa logical place to put the index files (i.e., the application already writes data there).

\[php\]\[/php\]

The first argument is the location on the server where the files should be stored. The second argument of true says that a new index should be created. You wouldn’t use this argument to perform a search, naturally. Note that this code does assume you’re using the Zend Framework in some capacity and have already included the appropriate framework files. Also, this code was written using Zend Framework version 1.9.

\[php\]\[/php\]\[php\]\[/php\]

These simple steps will work, but this requires that you know what every URL to be indexed is (again, getting back to the issue of creating a spider or crawler). Also, this approach requires the loading of the full HTML page for each document being added, meaning that the indexer has to read through all the extraneous HTML, CSS, and JavaScript, then find only the content that counts. Instead, you can get the content that counts directly from the database.

Tip: You can create your own spider, if you wanted, as the Zend_Search_Lucene_Document_Html object will have a getLinks() method that returns every link found within the HTML document.

\[php\]\[/php\]\[php\]\[/php\]

Within the loop, each record needs to be added to the index as a new document. And here’s where you need to think about how the search and the search results should work. The two basic questions are:

What content should be search-able?
What content should be displayed on the search results?

\[php\]\[/php\]

First, I create the URL so that the search results will be able to link tot he page. The URL for these particular records will be pages/show/id/X: show the Page whose ID value is X. Next, create a new Zend_Search_Lucene_Document object, which is the most generic search document class. From there, you add “fields” to the document. Each field can be of a specific type, can be given a unique name, and be assigned a value.

Tip: The Zend_Search_Lucene_Document_Html class is just a Zend_Search_Lucence_Document with pre-defined fields

The field types are all identified using Zend_Search_Lucene_Field:: Something. They’re all listed in the Zend documentation and you’ll want to look those up before you get too involved with implementing this yourself. There are several important properties about the different field types. In particular, you want to pay attention to whether the field will be indexed or not and stored or not. For example, there’s no need to index the URL, because it won’t be meaningful as a search item, but it must be stored (to appear in the search results), so that field is of type UnIndexed. Conversely, the page’s title needs to be both stored and indexed, so it’s of type Text. The page’s content must be indexed but shouldn’t be stored (because that’d be inefficient), so it uses a field type of Unstored. Finally, I want to store but not index a preview of the document, so I use the UnIndexed type there again.

\[php\]\[/php\]

The clean_content() function strips out all the tags. This allows me to index just the text itself, without any HTML. Although defining a function that just invokes a single PHP function is generally a no-no, I went ahead and did that here in case I later decide to change how content is “cleaned”. The second function returns a subsection of the content, to be used as a preview. The preview will appear in the search results and is, by default, the first 400 characters of the content followed by ellipses. Once a new document has been created, the last line (in the loop above) adds the document to the index.

\[php\]\[/php\]

And that’s it! A custom index has now been created based upon data stored in the database. Is it a perfect replication of Google? No. But it works well as a custom index for a Web site. Once you understand how to create your own index, you can expand it, of course. In the site I worked on, there was a contacts table that stored lists of people (displayed on a single page in the site). To index that content, I retrieved all the contacts, and looped through them. For the title value, I used each person’s full name. For the content to be indexed, I concatenated several pieces of data about each contact: their name, title, organization, personal description, and so forth.

In this particular project, I also indexed PDFs stored on the server, a topic I’ll write up in a separate post.

I hope this was helpful. Thanks for reading and, as always, let me know if you have any questions or comments!