<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
xmlns:series="http://unfoldingneurons.com/"
><channel><title>Larry Ullman &#187; search</title> <atom:link href="http://www.larryullman.com/tag/search/feed/" rel="self" type="application/rss+xml" /><link>http://www.larryullman.com</link> <description>Translating Geek Into English</description> <lastBuildDate>Mon, 21 May 2012 11:03:07 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <item><title>Creating a Search Index with Zend_Search_Lucene</title><link>http://www.larryullman.com/2010/10/27/creating-a-search-index-with-zend_search_lucene/</link> <comments>http://www.larryullman.com/2010/10/27/creating-a-search-index-with-zend_search_lucene/#comments</comments> <pubDate>Wed, 27 Oct 2010 09:00:22 +0000</pubDate> <dc:creator>Larry</dc:creator> <category><![CDATA[MySQL]]></category> <category><![CDATA[PHP]]></category> <category><![CDATA[Web Development]]></category> <category><![CDATA[lucene]]></category> <category><![CDATA[search]]></category> <category><![CDATA[zend]]></category><guid
isPermaLink="false">http://www.larryullman.com/?p=807</guid> <description><![CDATA[I suspect people commonly underestimate what&#8217;s required of a good Web site search engine. Some developers probably think that you just create a search box and then use the supplied terms in a database query to get the results. But there are actually three aspects to a search engine: The index of the content to [...]]]></description> <content:encoded><![CDATA[<p>I suspect people commonly underestimate what&#8217;s required of a good Web site search engine. Some developers probably think that you just create a search box and then use the supplied terms in a database query to get the results. But there are actually three aspects to a search engine:</p><ul><li>The <em>index</em> of the content to be searched</li><li>The act of actually <em>performing the search</em></li><li>The <em>reporting</em> of the search results</li></ul><p>Many people, I believe, only really think about these last two, but it&#8217;s really the index that&#8217;s key to the success of any search engine, just like a good index at the back of a book makes it possible for a reader to quickly find what they&#8217;re looking for.</p><p>As far as search engines go, the gold standard is <a
href="http://lucene.apache.org">Apache Lucene</a>. Lucene has been a reliable and popular search engine of choice for years now. Although Lucene is written in Java, the <a
href="http://framework.zend.com/manual/en/zend.search.lucene.html">Zend_Search_Lucene</a> module, part of the <a
href="http://framework.zend.com">Zend Framework</a>, is a great PHP port of the software. <a
href="http://www.larryullman.com/2009/12/05/integrating-zend_lucene-with-yii/">In a previous post</a>, I explained how to integrate <strong>Zend_Search_Lucene</strong> into a <a
href="http://www.yiiframework.com">Yii</a>-based  Web application. The focus in that post is really on getting the two  different frameworks to work together. This is easily accomplished for  two reasons:</p><ol><li> Yii supports third-party tools nicely</li><li>The Zend Framework can be used piecemeal</li></ol><p>So that previous post on Yii and <strong>Zend_Search_Lucene</strong> walks you through the Yii Controller and View files you&#8217;d create to perform a search and report upon the results, something that <strong>Zend_Search_Lucene</strong> does easily. Creating the  index itself is the actual challenge, then, and one that I don&#8217;t feel  is adequately documented elsewhere. In this post, I explain how to use <strong>Zend_Search_Lucene</strong> to create a search index of  a site.</p><p><span
id="more-807"></span></p><p>There are different ways you can create an index of any Web site. One option is to use a spider: provide it with a starting URL and it will then crawl through the site, indexing content, noting and following links, etc. Unfortunately—and this is something that&#8217;s not obvious to those new to Lucene, <span
style="text-decoration: underline;">Lucene is not a spider and cannot crawl your site</span> (the same goes for <strong>Zend_Search_Lucene</strong>). Lucene is simply a search engine. One option, which I have played around with, is to use a separate crawler that will create a Lucene-compatible index for you. The most popular such tool is perhaps <a
href="http://nutch.apache.org/">Nutch</a>, also from Apache. Nutch is an excellent crawler, but it&#8217;s written in Java and getting it to output Lucene-compatible indexes isn&#8217;t easy. I worked on that for some time and often felt like I was spinning my wheels (in part because I have no formal Java training). In the end, I decided to go another route&#8230;</p><p>Running a spider through a site (over HTTP) makes sense particularly when no internal access to the content is available. However, with a database-driven site, some of the site&#8217;s content can be discovered in another, easier way: in the database itself! In fact, unless a site has static pages, probably all of the site&#8217;s content can be found in the database. From database queries alone, you can generate a custom index. I&#8217;ll explain how to do that, but stepping back a bit&#8230;</p><p>Lucene stores information about content in a series of files on the server. In my Yii Controller, say <strong>SearchController.php</strong>, I identify the location of the index files as a private variable, so that the location will be available in multiple methods:</p><pre class="brush: php; title: ; notranslate">private $_indexFile =  '/runtime/search';</pre><p>Yii&#8217;s <strong>runtime</strong> directory is where the Yii application stores  information as the site runs, such as logs, so it&#8217;sa logical place  to put the index files (i.e., the application already writes data  there).</p><p>Whether you&#8217;re creating an index for the first time or using it to execute a search query, you must create an object of type <strong>Zend_Search_Lucene</strong>:</p><pre class="brush: php; title: ; notranslate">$index = new Zend_Search_Lucene($this-&gt;_indexFile, true);</pre><p>The first argument is the location on the server where the files should be stored. The second argument of <strong>true</strong> says that a new index should be created. You wouldn&#8217;t use this argument to perform a search, naturally. Note that this code does assume you&#8217;re using the Zend Framework in some capacity and have already included the appropriate framework files. Also, this code was written using Zend Framework version 1.9.</p><p>Once the index directory has been identified, the next step is to add documents to the index, which is the important part. One option is to use the <strong>Zend_Search_Lucene_Document_HTML</strong> class. Call the class&#8217;s <strong>loadHTMLFile()</strong> method, passing it the URL of the page to load:</p><pre class="brush: php; title: ; notranslate">$url = 'http://www.example.com/somepage.php';
$doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($url);</pre><p>Then you can add that document to the index:</p><pre class="brush: php; title: ; notranslate">$index-&gt;addDocument($doc);</pre><p>These simple steps will work, but this requires that you know what every URL to be indexed is (again, getting back to the issue of creating a spider or crawler). Also, this approach requires the loading of the full HTML page for each document being added, meaning that the indexer has to read through all the extraneous HTML, CSS, and JavaScript, then find only the content that counts. Instead, you can get the content that counts directly from the database.</p><blockquote><p>Tip: You can create your own spider, if you wanted, as the <strong>Zend_Search_Lucene_Document_Html</strong> object will have a <strong>getLinks()</strong> method that returns every link found within the HTML document.</p></blockquote><p>On a recent project I did, for each page in the site, the page-specific content was stored as a record in a <strong>pages</strong> table. When the user views the page in the browser, the page-specific content is retrieved and displayed within a context (the overall site template). Nothing else in that template needs to be indexed as it is all common to every page, so I can just index the data stored in the database instead. To do that, I first use the Yii Model to fetch every database record:</p><pre class="brush: php; title: ; notranslate">$model = Pages::model()-&gt;findAll();</pre><p>Next, loop through each returned record:</p><pre class="brush: php; title: ; notranslate">foreach ($model as $m) {</pre><p>Within the loop, each record needs to be added to the index as a new document. And here&#8217;s where you need to think about how the search and the search results should work. The two basic questions are:</p><ul><li>What content should be search-able?</li><li>What content should be displayed on the search results?</li></ul><p>In my example, the search-able content should be the page-specific content and its title. The search results should show the page title, a preview of its content, and be linked to the page itself. With that in mind, here&#8217;s how my loop indexed each returned record:</p><pre class="brush: php; title: ; notranslate">$url = 'http://www.example.com/index.php/pages/show/id/' . $m-&gt;id;
$doc = new Zend_Search_Lucene_Document();
$doc-&gt;addField(Zend_Search_Lucene_Field::UnIndexed('link', $url));
$doc-&gt;addField(Zend_Search_Lucene_Field::Text('title', $m-&gt;pageTitle));
$content = $this-&gt;clean_content($m-&gt;content);
$doc-&gt;addField(Zend_Search_Lucene_Field::Unstored('content', $content));
$doc-&gt;addField(Zend_Search_Lucene_Field::UnIndexed('preview', $this-&gt;preview_content($content)));
$index-&gt;addDocument($doc);</pre><p>First, I create the URL so that the search results will be able to link tot he page. The URL for these particular records will be <span
style="text-decoration: underline;">pages/show/id/X</span>: show the <strong>Page</strong> whose ID value is X. Next, create a new <strong>Zend_Search_Lucene_Document</strong> object, which is the most generic search document class. From there, you add &#8220;fields&#8221; to the document. Each field can be of a specific type, can be given a unique name, and be assigned a value.</p><blockquote><p>Tip: The <strong>Zend_Search_Lucene_Document_Html</strong> class is just a Zend_Search_Lucence_Document with pre-defined fields</p></blockquote><p>The field types are all identified using <strong>Zend_Search_Lucene_Field::<em>Something</em></strong>. They&#8217;re all listed in the Zend documentation and you&#8217;ll want to look those up before you get too involved with implementing this yourself. There are several important properties about the different field types. In particular, you want to pay attention to whether the field will be indexed or not and stored or not. For example, there&#8217;s no need to index the URL, because it won&#8217;t be meaningful as a search item, but it must be stored (to appear in the search results), so that field is of type <strong>UnIndexed</strong>. Conversely, the page&#8217;s title needs to be both stored and indexed, so it&#8217;s of type <strong>Text</strong>. The page&#8217;s content must be indexed but shouldn&#8217;t be stored (because that&#8217;d be inefficient), so it uses a field type of <strong>Unstored</strong>. Finally, I want to store but not index a preview of the document, so I use the <strong>UnIndexed</strong> type there again.</p><p>Again, this is where you&#8217;re actually creating the custom index: you decide what should be indexed and stored, in other words, what&#8217;s important. You might also store, for example, the modification date of the content (but not index the date). You&#8217;ll also see in that code that I&#8217;m running the content through two user-defined functions: <strong>clean_content()</strong> and <strong>preview_content()</strong>. Here are how those are defined:</p><pre class="brush: php; title: ; notranslate">// Function for returning a preview of the content:
// The preview is the first XXX characters.
private function preview_content($data, $limit = 400) {
    return substr($data, 0, $limit) . '...';
} // End of preview_content() function.
// Function for stripping junk out of content:
private function clean_content($data) {
    return strip_tags($data);
}</pre><p>The clean_content() function strips out all the tags. This allows me to index just the text itself, without any HTML. Although defining a function that just invokes a single PHP function is generally a no-no, I went ahead and did that here in case I later decide to change how content is &#8220;cleaned&#8221;. The second function returns a subsection of the content, to be used as a preview. The preview will appear in the search results and is, by default, the first 400 characters of the content followed by ellipses.<br
/> Once a new document has been created, the last line (in the loop above) adds the document to the index.</p><p>After all of the content has been indexed (and you&#8217;ve completed the <strong>foreach</strong> loop), you can optimize and commit the index:</p><pre class="brush: php; title: ; notranslate">$index-&gt;optimize();
$index-&gt;commit();</pre><p>And that&#8217;s it! A custom index has now been created based upon data stored in the database. Is it a perfect replication of Google? No. But it works well as a custom index for a Web site. Once you understand how to create your own index, you can expand it, of course. In the site I worked on, there was a <strong>contacts</strong> table that stored lists of people (displayed on a single page in the site). To index that content, I retrieved all the contacts, and looped through them. For the title value, I used each person&#8217;s full name. For the content to be indexed, I concatenated several pieces of data about each contact: their name, title, organization, personal description, and so forth.</p><p>In this particular project, I also indexed PDFs stored on the server, a topic I&#8217;ll write up in a separate post.</p><p>I hope this was helpful. Thanks for reading and, as always, let me know if you have any questions or comments!</p> ]]></content:encoded> <wfw:commentRss>http://www.larryullman.com/2010/10/27/creating-a-search-index-with-zend_search_lucene/feed/</wfw:commentRss> <slash:comments>8</slash:comments> </item> <item><title>Integrating Zend_Lucene with Yii</title><link>http://www.larryullman.com/2009/12/05/integrating-zend_lucene-with-yii/</link> <comments>http://www.larryullman.com/2009/12/05/integrating-zend_lucene-with-yii/#comments</comments> <pubDate>Sat, 05 Dec 2009 01:47:16 +0000</pubDate> <dc:creator>Larry</dc:creator> <category><![CDATA[PHP]]></category> <category><![CDATA[Web Development]]></category> <category><![CDATA[framework]]></category> <category><![CDATA[search]]></category> <category><![CDATA[yii]]></category> <category><![CDATA[zend]]></category><guid
isPermaLink="false">http://www.larryullman.com/?p=790</guid> <description><![CDATA[I&#8217;m just not a big fan of using the Zend Framework as my Web development tool, but one of the framework&#8217;s nicest features is that you can use only the parts of it you need. I am, however, a big fan of the Yii framework and one of its many plusses is that you can [...]]]></description> <content:encoded><![CDATA[<p>I&#8217;m just not a big fan of using the Zend Framework as my Web development tool, but one of the framework&#8217;s nicest features is that you can use only the parts of it you need. I am, however, a big fan of the Yii framework and one of its many plusses is that you can easily integrate other frameworks and tools into it. Like, for example, the Zend Framework. Yii does not have its own search engine functionality, and Apache&#8217;s Lucene is arguably the gold standard (although clearly not the only choice), so tapping into Zend&#8217;s Lucene module for a Yii-driven site makes a lot of sense. In this post, I&#8217;ll walk you through the steps for integrating  Zend_Lucene into Yii. This post does assume familiarity with PHP, MVC, and Yii.<span
id="more-790"></span>To start, let&#8217;s create a spot in the Yii application for the Zend Framework. Create a new directory called <strong>vendors</strong> within the Yii <strong>protected</strong> folder. This isn&#8217;t required, but as the Zend Framework is a different beast than all the Yii code, I think it&#8217;s best to separate it out. Within <strong>vendors</strong>, create a directory called <strong>Zend</strong> (or ZendFramework, if you&#8217;d rather).</p><p>Next, <a
href="http://framework.zend.com/download/latest">download the Zend Framework</a>. You&#8217;ll want to download the latest full package, even though you&#8217;ll only use a bit of it. After the download has completed, expand the ZIP or TAR.GZ file (whichever format you choose to download the framework in). The result will be a folder named <strong>ZendFramework-<em>x.y.z</em></strong>. (where <em>x.y.z</em> represent the full version number). Within that folder, go into <strong>library/Zend</strong> and copy <strong>Exception.php</strong> to <strong>protected/vendors/Zend</strong>. This is the file that the Zend Framework uses to report problems, so you&#8217;ll want to include it while developing and debugging Zend_Lucene with Yii. Also copy the <strong>Search</strong> folder to <strong>protected/vendors/Zend</strong>. You&#8217;ll end up with a structure like this:</p><p>In terms of the MVC architecture, the Zend Framework provides the Model to be used by this search process, but the Controller and View will still be done using Yii. First, let&#8217;s write a new Controller for searching:</p><pre>class SearchController extends CController
{
    private $_indexFiles = '../runtime/search';
    public function actionIndex() {}
    public function actionCreate() {}
    public function actionSearch() {}
    public function actionUpdate() {}
}</pre><p>As with all Yii Controllers, this one extends the base <span
style="text-decoration: underline;">CController</span> class. Within this Controller the various methods are defined, corresponding to the actions that&#8217;ll be taken in the search process. The <em>index</em> action is the default and is for accessing the search page without performing an actual search (e.g., clicking on a link to go to the search page). The <em>create</em> action will be used to generate the search database: the series of files that Lucene needs to perform its searches. The <em>search</em> action is for handling submission of the search form (i.e., it does the actual searching). Finally, the <em>update</em> action is for updating the Lucene database files when necessary (like when the site content changes). The class also has one private variable that stores the location on the server of Lucene database files. I chose to put them in a <strong>search</strong> folder found within <strong>runtime</strong> (<strong>protected/runtime/search</strong>). This class member is good to have as multiple methods will need this information but I create it as a private variable as it&#8217;s not necessary (nor should it be accessed) outside of the class. As a naming convention, some like to use underscores at the front of private class variables.</p><p>Within three of the methods (not <strong>actionIndex()</strong>), the Controller will use Zend_Lucene. In order to do so, this script needs access to the Zend files, so import the contents of the <strong>vendors</strong> directory at the top of this script, just before the class definition begins:</p><pre>Yii::import('application.vendors.*');</pre><p>Then, include the <strong>Lucene.php</strong> page, found within the Zend Framework <strong>Search</strong> folder:</p><pre>require_once('Zend/Search/Lucene.php');</pre><p>Now this Yii Controller can create objects of type <span
style="text-decoration: underline;">Zend_Search_Lucene</span>, which is defined in that file. The actions will use that object type to perform the searches. To start, the index action just renders the index View:</p><pre>public function actionIndex()
{
    $this-&gt;render('index');
}</pre><p>Presumably the index View file just shows the search form. The search form, by the way, should have an action attribute of <span
style="text-decoration: underline;">www.example.com/index.php/search/search</span>, so that it calls the search action of the search Controller. The form should contain a text input with the name <em>terms</em>.</p><p>The update action would be used by an administrator to update the search database. Perhaps it&#8217;d be called automatically after some content is generated or once per hour or day. It would destroy the existing search database and then invoke the <strong>actionCreate()</strong> method. The Lucene database can&#8217;t just be updated for whatever content changed; you need to destroy and recreate it instead. It really wouldn&#8217;t matter what View this action renders, depending upon what you want the admin to see. Maybe the View would just show a message indicating that the database has been updated.</p><p>The create action is an important one, and is where real knowledge of Lucene comes into play. The shell of it would look like so:</p><pre>public function actionCreate() {
    $index = new Zend_Search_Lucene($this-&gt;_indexFile, true);
    // Add documents to the database.
    $index-&gt;commit();
    $this-&gt;render('create');
}</pre><p>First, a <span
style="text-decoration: underline;">Zend_Search_Lucene </span>object is created (again, this is where Yii is making use of a class defined outside of Yii; it&#8217;s a sweet thing). The first argument provided when creating the object is the location of the database files. This is represented by the Controller variable, accessible in <strong>$this-&gt;_indexFile</strong>. The second argument indicates that a fresh database should be created. Next up, you add content to the database. This is complicated and well beyond the scope of what I&#8217;m writing here. I&#8217;ll try to discuss this, in brief, in a separate post, but I&#8217;d recommend you read as much as you can online first. In a very minimalistic way, you could add a single HTML page to the search database by doing this:</p><pre>$url = 'http://www.example.com/index.php/page/show/id/1';
$doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($url);
$index-&gt;addDocument($doc);</pre><p>Finally the database has to be saved, by invoking the <strong>commit()</strong> method. And then some View is rendered. As this action would also only be likely called by an administrator or cron, it doesn&#8217;t matter much what the View contains.</p><p>Lastly, there&#8217;s the search action. This action would check for search terms, run the search against the database, then send the results on to a View:</p><pre>public function actionSearch() {
    if (isset($_GET['terms'])) {
        $index = new Zend_Search_Lucene($this-&gt;_indexFile);
        $results = $index-&gt;find($_GET['terms']);
        $this-&gt;render('search', array('results' =&gt; $results));
    } else {
        $this-&gt;render('index');
    }
}</pre><p>First the method checks for the presence of search terms in the URL. Then it creates a <span
style="text-decoration: underline;">Zend_Search_Lucene</span> object, which is necessary for both creating and using the search database. This time only the location of the search database is passed when creating the object. The object&#8217;s <strong>find()</strong> method is invoked for performing the search (it can be that simple!). Then the search View is rendered, passing it the results. If no search terms were passed to this page, the index View is rendered instead. As for the search results View, a basic version to get you started might look like this:</p><pre>&lt;h2&gt;Search Results for "&lt;?php echo CHtml::encode($_GET['terms']); ?&gt;"&lt;/h2&gt;
&lt;?php if ($results): ?&gt;
    &lt;?php foreach($results as $result): ?&gt;
        &lt;p&gt;&lt;?php echo CHtml::encode($result-&gt;title); ?&gt;&lt;/p&gt;
    &lt;?php endeach; ?&gt;
&lt;?php else: ?&gt;
    &lt;p class="error"&gt;No results matched your search terms.&lt;/p&gt;
&lt;?php endif; ?&gt;</pre><p>That&#8217;s largely the logic and structure of a search results View. It displays the provided search terms and checks for results. If there were some, each result title is printed. In a real application, you&#8217;d likely link the title to a URL or whatever but I don&#8217;t want to get too messy here. If you do <strong>print_r($result)</strong>, you&#8217;ll see a bunch of information there that you can use.</p><p>So that&#8217;s the steps you need to take to get started using Zend_Lucene within your Yii application. These steps provide functionality; mastering Lucene is how you make this more professional. I&#8217;ll try to write more about defining a Lucene search database in subsequent posts towards that end. If you have any comments, questions, or requests, let me know.</p><p>Thanks,</p><p>Larry</p> ]]></content:encoded> <wfw:commentRss>http://www.larryullman.com/2009/12/05/integrating-zend_lucene-with-yii/feed/</wfw:commentRss> <slash:comments>17</slash:comments> </item> </channel> </rss>
<!-- Served from: www.larryullman.com @ 2012-05-21 15:32:00 by W3 Total Cache -->
