Part II - Indexing PDF Documents with Zend_Search_Lucene

January 26th, 2008

Part II

Now that we know how to index PDF documents we need to be about to search that index and return relevant results. When the results are returned we also want open the specified PDF to the correct page.

The Code

Fist like in the last example we have to tell the server what directory we are going to work in and where to find the Zend Search Lucene.


< ?php/** Zend_Search_Lucene */

ini_set('include_path','/PATH/TO/DIR/');

require_once 'Zend/Search/Lucene.php';

?>

Next is the form for the user to input there search query. This is just a basic form that will resubmit to the same page.


< ?phpecho "<form>";

echo "<input type="text" name="search" value="".$_REQUEST['search']."" />

<input type="submit" />";

echo "";

?>

Next is the meat of the page. We will look to see if the search query is set. If it is still NULL then we will just ignore this section. Basically what goes on is we capture the query the user entered and strip out the word “to”. There may be a better way to do this but this is what I found easiest. Oh yea the reason is that Lucene interprets the word “to” as a range and doesn’t return the results you will want. Next we open the index and find the terms that the user entered. This gets stored in the array. When we want to loop through the results we just call a foreach statement on the array. For this array we will get the Score (score of the search terms), the URL, page number, and contents (we will only return 500 characters). To open the PDF to the correct page it is as easy as passing in a url parameter, #page=X.


< ?phpif(isset($_REQUEST['search']) &amp;&amp; trim($_REQUEST['search']) != "" ) {

$query = trim($_REQUEST['search']);

$query = str_ireplace("to", "",$query);	// Open existing index

$index = Zend_Search_Lucene::open('/PATH/TO/INDEX/');

$hits = $index->find($query);

foreach ($hits as $hit) {

echo "<a href="/RELATIVE/PATH/TO/PDF/".$hit->url."#page=".$hit->page."">PDF</a><br />";

echo number_format((($hit->score)*100),3)."%<br />";

echo substr($hit->contents,0,500)."<br />";

echo "Data on Page ".$hit->page."<br />";

echo "<br />";

}

}

?>

That’s it! Just like before searching has many options. Be sure to check out the Zend_Search_Lucene for these search options. Need help, leave comments or check out Part I


Leave a Reply