Part II – Indexing PDF Documents with Zend_Search_Lucene
January 26th, 2008
Part II
Now that we know how to index PDF documents we need to be about to search that index and return relevant results. When the results are returned we also want open the specified PDF to the correct page.
The Code
Fist like in the last example we have to tell the server what directory we are going to work in and where to find the Zend Search Lucene.
< ?php/** Zend_Search_Lucene */
ini_set('include_path','/PATH/TO/DIR/');
require_once 'Zend/Search/Lucene.php';
?>
Next is the form for the user to input there search query. This is just a basic form that will resubmit to the same page.
< ?phpecho "<form>"; echo "<input type=\"text\" name=\"search\" value=\"".$_REQUEST['search']."\" /> <input type=\"submit\" />"; echo ""; ?>
Next is the meat of the page. We will look to see if the search query is set. If it is still NULL then we will just ignore this section. Basically what goes on is we capture the query the user entered and strip out the word “to”. There may be a better way to do this but this is what I found easiest. Oh yea the reason is that Lucene interprets the word “to” as a range and doesn’t return the results you will want. Next we open the index and find the terms that the user entered. This gets stored in the array. When we want to loop through the results we just call a foreach statement on the array. For this array we will get the Score (score of the search terms), the URL, page number, and contents (we will only return 500 characters). To open the PDF to the correct page it is as easy as passing in a url parameter, #page=X.
< ?phpif(isset($_REQUEST['search']) && trim($_REQUEST['search']) != "" ) {
$query = trim($_REQUEST['search']);
$query = str_ireplace("to", "",$query); // Open existing index
$index = Zend_Search_Lucene::open('/PATH/TO/INDEX/');
$hits = $index->find($query);
foreach ($hits as $hit) {
echo "<a href=\"/RELATIVE/PATH/TO/PDF/".$hit->url."#page=".$hit->page."\">PDF</a><br />";
echo number_format((($hit->score)*100),3)."%<br />";
echo substr($hit->contents,0,500)."<br />";
echo "Data on Page ".$hit->page."<br />";
echo "<br />";
}
}
?>
That’s it! Just like before searching has many options. Be sure to check out the Zend_Search_Lucene for these search options. Need help, leave comments or check out Part I
Daniel said:
Thanks for that great stuff!
But after getting it done to run the xpdf and search_lucene i got the following problem:
After indexing a pdf and searching it with your code – it only searches in content of the first page of the pdf!
Maybe the problem is during the indexing or the search! I don’t know!
Can anyone help me?
greets daniel
farrelley said:
Hey Daniel.. It sounds like the pdf isn’t being parsed out into multiple pages. I would check to see that when you put the pages into the index that its not inserting just the first page. You can also send me the pdf and I can take a look at it. I am pretty sure that there is a problem with the parsing of the pdf into pages.
Jean said:
Thanks a lot for your great articles.
How about highlighting the search terms in pdf? Any suggestions?