Creating a Search Index with Zend_Search_Lucene

October 27, 2010

I suspect people commonly underestimate what’s required of a good Web site search engine. Some developers probably think that you just create a search box and then use the supplied terms in a database query to get the results. But there are actually three aspects to a search engine:

  • The index of the content to be searched
  • The act of actually performing the search
  • The reporting of the search results

Many people, I believe, only really think about these last two, but it’s really the index that’s key to the success of any search engine, just like a good index at the back of a book makes it possible for a reader to quickly find what they’re looking for.

As far as search engines go, the gold standard is Apache Lucene. Lucene has been a reliable and popular search engine of choice for years now. Although Lucene is written in Java, the Zend_Search_Lucene module, part of the Zend Framework, is a great PHP port of the software. [intlink id=”790″ type=”post”]In a previous post[/intlink], I explained how to integrate Zend_Search_Lucene into a Yii-based Web application. The focus in that post is really on getting the two different frameworks to work together. This is easily accomplished for two reasons:

  1. Yii supports third-party tools nicely
  2. The Zend Framework can be used piecemeal

So that previous post on Yii and Zend_Search_Lucene walks you through the Yii Controller and View files you’d create to perform a search and report upon the results, something that Zend_Search_Lucene does easily. Creating the index itself is the actual challenge, then, and one that I don’t feel is adequately documented elsewhere. In this post, I explain how to use Zend_Search_Lucene to create a search index of a site.

There are different ways you can create an index of any Web site. One option is to use a spider: provide it with a starting URL and it will then crawl through the site, indexing content, noting and following links, etc. Unfortunately—and this is something that’s not obvious to those new to Lucene, Lucene is not a spider and cannot crawl your site (the same goes for Zend_Search_Lucene). Lucene is simply a search engine. One option, which I have played around with, is to use a separate crawler that will create a Lucene-compatible index for you. The most popular such tool is perhaps Nutch, also from Apache. Nutch is an excellent crawler, but it’s written in Java and getting it to output Lucene-compatible indexes isn’t easy. I worked on that for some time and often felt like I was spinning my wheels (in part because I have no formal Java training). In the end, I decided to go another route…

Running a spider through a site (over HTTP) makes sense particularly when no internal access to the content is available. However, with a database-driven site, some of the site’s content can be discovered in another, easier way: in the database itself! In fact, unless a site has static pages, probably all of the site’s content can be found in the database. From database queries alone, you can generate a custom index. I’ll explain how to do that, but stepping back a bit…

Lucene stores information about content in a series of files on the server. In my Yii Controller, say SearchController.php, I identify the location of the index files as a private variable, so that the location will be available in multiple methods:

private $_indexFile =  '/runtime/search';

Yii’s runtime directory is where the Yii application stores information as the site runs, such as logs, so it’sa logical place to put the index files (i.e., the application already writes data there).

Whether you’re creating an index for the first time or using it to execute a search query, you must create an object of type Zend_Search_Lucene:

$index = new Zend_Search_Lucene($this->_indexFile, true);

The first argument is the location on the server where the files should be stored. The second argument of true says that a new index should be created. You wouldn’t use this argument to perform a search, naturally. Note that this code does assume you’re using the Zend Framework in some capacity and have already included the appropriate framework files. Also, this code was written using Zend Framework version 1.9.

Once the index directory has been identified, the next step is to add documents to the index, which is the important part. One option is to use the Zend_Search_Lucene_Document_HTML class. Call the class’s loadHTMLFile() method, passing it the URL of the page to load:

$url = 'http://www.example.com/somepage.php';
$doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($url);

Then you can add that document to the index:

$index->addDocument($doc);

These simple steps will work, but this requires that you know what every URL to be indexed is (again, getting back to the issue of creating a spider or crawler). Also, this approach requires the loading of the full HTML page for each document being added, meaning that the indexer has to read through all the extraneous HTML, CSS, and JavaScript, then find only the content that counts. Instead, you can get the content that counts directly from the database.

Tip: You can create your own spider, if you wanted, as the Zend_Search_Lucene_Document_Html object will have a getLinks() method that returns every link found within the HTML document.

On a recent project I did, for each page in the site, the page-specific content was stored as a record in a pages table. When the user views the page in the browser, the page-specific content is retrieved and displayed within a context (the overall site template). Nothing else in that template needs to be indexed as it is all common to every page, so I can just index the data stored in the database instead. To do that, I first use the Yii Model to fetch every database record:

$model = Pages::model()->findAll();

Next, loop through each returned record:

foreach ($model as $m) {

Within the loop, each record needs to be added to the index as a new document. And here’s where you need to think about how the search and the search results should work. The two basic questions are:

  • What content should be search-able?
  • What content should be displayed on the search results?

In my example, the search-able content should be the page-specific content and its title. The search results should show the page title, a preview of its content, and be linked to the page itself. With that in mind, here’s how my loop indexed each returned record:

$url = 'http://www.example.com/index.php/pages/show/id/' . $m->id;
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('link', $url));
$doc->addField(Zend_Search_Lucene_Field::Text('title', $m->pageTitle));
$content = $this->clean_content($m->content);
$doc->addField(Zend_Search_Lucene_Field::Unstored('content', $content));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('preview', $this->preview_content($content)));
$index->addDocument($doc);

First, I create the URL so that the search results will be able to link tot he page. The URL for these particular records will be pages/show/id/X: show the Page whose ID value is X. Next, create a new Zend_Search_Lucene_Document object, which is the most generic search document class. From there, you add “fields” to the document. Each field can be of a specific type, can be given a unique name, and be assigned a value.

Tip: The Zend_Search_Lucene_Document_Html class is just a Zend_Search_Lucence_Document with pre-defined fields

The field types are all identified using Zend_Search_Lucene_Field::Something. They’re all listed in the Zend documentation and you’ll want to look those up before you get too involved with implementing this yourself. There are several important properties about the different field types. In particular, you want to pay attention to whether the field will be indexed or not and stored or not. For example, there’s no need to index the URL, because it won’t be meaningful as a search item, but it must be stored (to appear in the search results), so that field is of type UnIndexed. Conversely, the page’s title needs to be both stored and indexed, so it’s of type Text. The page’s content must be indexed but shouldn’t be stored (because that’d be inefficient), so it uses a field type of Unstored. Finally, I want to store but not index a preview of the document, so I use the UnIndexed type there again.

Again, this is where you’re actually creating the custom index: you decide what should be indexed and stored, in other words, what’s important. You might also store, for example, the modification date of the content (but not index the date). You’ll also see in that code that I’m running the content through two user-defined functions: clean_content() and preview_content(). Here are how those are defined:

// Function for returning a preview of the content:
// The preview is the first XXX characters.
private function preview_content($data, $limit = 400) {
    return substr($data, 0, $limit) . '...';
} // End of preview_content() function.
// Function for stripping junk out of content:
private function clean_content($data) {
    return strip_tags($data);
}

The clean_content() function strips out all the tags. This allows me to index just the text itself, without any HTML. Although defining a function that just invokes a single PHP function is generally a no-no, I went ahead and did that here in case I later decide to change how content is “cleaned”. The second function returns a subsection of the content, to be used as a preview. The preview will appear in the search results and is, by default, the first 400 characters of the content followed by ellipses.
Once a new document has been created, the last line (in the loop above) adds the document to the index.

After all of the content has been indexed (and you’ve completed the foreach loop), you can optimize and commit the index:

$index->optimize();
$index->commit();

And that’s it! A custom index has now been created based upon data stored in the database. Is it a perfect replication of Google? No. But it works well as a custom index for a Web site. Once you understand how to create your own index, you can expand it, of course. In the site I worked on, there was a contacts table that stored lists of people (displayed on a single page in the site). To index that content, I retrieved all the contacts, and looped through them. For the title value, I used each person’s full name. For the content to be indexed, I concatenated several pieces of data about each contact: their name, title, organization, personal description, and so forth.

In this particular project, I also indexed PDFs stored on the server, a topic I’ll write up in a separate post.

I hope this was helpful. Thanks for reading and, as always, let me know if you have any questions or comments!