Indexing PDF Documents with Zend_Search_Lucene

January 20th, 2008

Part I

I along with many others have been trying and asking how to index and search PDF files. Once Zend released its Framework, which is a port of Java Lucene to PHP, I decided to jump on board and find a way to index and search PDF files. So… Lets get started.

The Setup…

We will use XPDF to parse the PDF files and Zend_Search_Lucene for indexing and searching. First we need to read the PDF files and get relevant PDF information such as Title, Author, Modification Date, PDF size, and number of pages in the file. To do all this, download XPDF and copy the pdftotext and pdfinfo into a directory on your web server. The pdfinfo call will get us metadata of the PDF and the pdftotext will convert the PDF to a text file. Converting the PDF to a text file will allow us to store the data in the index.

Next we will have to download the Zend Framework and copy it over to the web server. You can just copy over the Zend_Search folder and the necessary functions but for now we will make it easy and copy everything to the server.

The Code…

Lets start by parsing a file and storing the data into the index file. First we need to make sure that our include path is pointed to the working directory and that we can access the Zend Search Lucene functions.

/** Zend_Search_Lucene */
ini_set('include_path','/path/to/working/directory/);
require_once 'Zend/Search/Lucene.php';

Next we create and index or open an existing index. Here we will call the index pdf_index. This is created as a directory with index files inside it. When you call the create index everything will be created. There is no need to create the pdf_index manually.

/** if the index exists then open it.  Otherwise we will create the index */
if(is_dir("/path/to/index/location/pdf_index/") == 1) {
// Open existing index
$index = Zend_Search_Lucene::open('/path/to/index/location/pdf_index/');
}
else {
//Create a new index
$index = Zend_Search_Lucene::create('/path/to/index/location/pdf_index/');
}

After the index we have to instantiate the document object which will be used to add data to the index.

$doc = new Zend_Search_Lucene_Document();

Next is the most important part of the parsing. We will use XPDF‘s pdfinfo to get metadata about the PDF file. This is a critical part of indexing a PDF since it will get the data that we need in order parse through the actual PDF file. We will store this data in an array and parse through it to get the information we need. Not all the metadata will be used but you can certainly use the additional data if you want.

//Name of the pdf document with out the extension
$pdf_filename = "KDS50A3000";
// get pdf information
$output = shell_exec ("pdfinfo ".$pdf_filename.".pdf");
//Gets the metadata
$data = explode("\n", $output); //puts it into an array
//Get the metadata that we need from the PDF.
//Parse through the Array and store in variables. */
for($c=0; $c < count($data); $c++) {
//Number of Pages
if(stristr($data[$c],"pages") == true) {
$pagestr = $data[$c];
}
//Author
if(stristr($data[$c],"author") == true) {
$authorstr = $data[$c];
}
//Title
if(stristr($data[$c],"title") == true) {
$titlestr = $data[$c];
}
//Modification Date
if(stristr($data[$c],"ModDate") == true) {
$moddatestr = $data[$c];
}
//File Size
if(stristr($data[$c],"File size") == true) {
$sizestr = $data[$c];
}
}
//Remove the titles to the metadata
$pages = trim(substr($pagestr,6));
$author = trim(substr($authorstr,7));
$title = trim(substr($titlestr,6));
$date = trim(substr($moddatestr,8));
$size = get_size(trim(substr(trim(substr($sizestr,10)),0,-5)));

With the file size we call another function called get_size to put the size in a human readable form.

function get_size($size) {
$bytes = array('B','KB','MB','GB','TB');
foreach($bytes as $val) {
if($size > 1024) {
$size = $size / 1024;
}
else {
break;
}
}
return round($size, 2)." ".$val;
}

Now we have all the necessary data that we need to parse through the PDF file, here is how we will do it. Since we know how many pages there are in the PDF we will create a for loop and increment through each page. So if we have a 2 page PDF file, we will convert page 1 and store it in the index then convert page two and store it in the index. Each page will be stored as a separate document in the index. This allows us to specify what page of the PDF the relevant search terms are referencing.

Below is the code that does just that.

// parse the pages into the index
for($i=1; $i < = $pages; $i++) {
//Here we will loop through all the page in the PDF

exec ("pdftotext -f ".$i." -l ".$i." ".$pdf_filename.".pdf ".$pdf_filename.$i.".txt");
// Store data in the document object that will be commited to the index at the end of the loop
$doc->addField(Zend_Search_Lucene_Field::Keyword('id', $i)); //Stores the ID
//Stores the File name of the PDF
$doc->addField(Zend_Search_Lucene_Field::Text('url', $pdf_filename.".pdf"));
//Here we will open the txt file (1 page of the pdf) and read it in to a String Variable
//Then we will store that sting in the document object for the index.
$filename = $pdf_filename.$i.".txt";
$handle = fopen ($filename, "rb");
$docContent = fread($handle, filesize($filename));
fclose($handle);
$doc->addField(Zend_Search_Lucene_Field::Text('contents', $docContent));
//Stores the page # so that we can open this page up
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('page', $i));
// Add all the data stored in the document object into the index.
$index->addDocument($doc);
//get rid of the txt file that the page data was stored in.
shell_exec ("rm ".$pdf_filename.$i.".txt");
echo "Page ".$i." Complete";
}
//Optimize index so that things run better.
//This is done after all the pages are inserted into the index
$index->optimize();
//Get some index data
$indexSize = $index->count();
$documents = $index->numDocs();
echo "Index Size: ".$indexSize;
echo "Document Count: ".$documents;

That’s pretty much it for putting the PDF data into the index. Now all that’s left is searching the index. which will be addressed in another post.

Let’s review a little…

Our goal was to index PDF files for searching. Using XPDF, Zend Search Lucene from the Zend Framework, and a PDF file, we were able to get metadata from the PDF file and parse the file page by page while inserting data into the index. What I didn’t get into was how the index works, what Zend_Search_Lucene_Fields work best, and what is index and what is not. That is all documented nicely at the Zend Framework site.

Also, this code is simple. It works with one PDF file, doesn’t check to see if the PDF is already indexed, and so forth. However, it can be very easily modified to do things dynamically. This is just a prototype to show that indexing and searching PDF’s can be done. Part II will be posted soon and will show you how to search the index. I also suggest looking at the following sites that may give you more incite on implementing Zend Search Lucene.

For now, enjoy! Oh yeah…here is the Zip File of the data files.


21 Comments to “Indexing PDF Documents with Zend_Search_Lucene”


  1. developercast.com » Kapustabrothers.com: Indexing PDF Documents with Zend_Search_Lucene said:

    [...] mentioned on the Zend Developer Zone, there’s a new post on kapustabrothers.com about a method for indexing all of those PDF files your site uses with the [...]


  2. Indexing PDF Documents with Zend_Search_Lucene_ 收藏自己的网络 said:

    [...] view plaincopy to clipboardprint? [...]


  3. Kapusta Brothers » Blog Archive » Indexing PDF’s - The Why? said:

    [...] I published my small article on “Indexing PDF Documents with Zend_Search_Lucene” I was surprised to find it on the Zend Developer Zone blog. I had no idea that this would [...]


  4. Kapusta Brothers » Blog Archive » Part II - Indexing PDF Documents with Zend_Search_Lucene said:

    [...] like in the last example we have to tell the server what directory we are going to work in and where to find the Zend Search [...]


  5.   Kapustabrothers.com: Indexing PDF Documents with Zend_Search_Lucene by Joe McLaughlin’s Blog said:

    [...] mentioned on the Zend Developer Zone, there’s a new post on kapustabrothers.com about a method for indexing all of those PDF files your site uses with the [...]


  6. amin2u said:

    i have a lot of PDF files and park it on a directory where all intranet user can access. They can access the file but they don’t have the engine to index the file. Do you have such solution to facilitate user to search the relevant content in the PDF files just by using a keyword.

    Right now i intended to have a powerful web-based engine same as Google Desktop Search


  7. amin2u said:

    how to install XPDF on windows??


  8. amin2u said:

    Fatal error: Uncaught exception ‘Zend_Search_Lucene_Exception’ with message ‘fopen(C:/wamp/www//pdf_index//segments) [function.fopen]: failed to open stream: No such file or directory’ in C:\wamp\www\pdf_index\Zend\Search\Lucene\Storage\File\Filesystem.php:64 Stack trace: #0 C:\wamp\www\pdf_index\Zend\Search\Lucene\Storage\Directory\Filesystem.php(338): Zend_Search_Lucene_Storage_File_Filesystem->__construct(‘C:/wamp/www//pd…’) #1 C:\wamp\www\pdf_index\Zend\Search\Lucene.php(235): Zend_Search_Lucene_Storage_Directory_Filesystem->getFileObject(‘segments’) #2 C:\wamp\www\pdf_index\Zend\Search\Lucene.php(182): Zend_Search_Lucene->__construct(‘C:/wamp/www//pd…’, false) #3 C:\wamp\www\pdf_index\searchtxt.php(11): Zend_Search_Lucene::open(‘C:/wamp/www//pd…’) #4 {main} thrown in C:\wamp\www\pdf_index\Zend\Search\Lucene\Storage\File\Filesystem.php on line 64

    I got the following errors, where to fix the error eh?


  9. farrelley said:

    admin2u: I would index your files with Zend and tag them with keywords. You can save each keyword in a document field for that pdf in the index.

    As for XPDF on windows – go here http://www.foolabs.com/xpdf/download.html and download the win32 version. your php scripts will have to call it with a system_exec or exec functions, becuase you need to call the xpdf with a DOS command.

    As for your error it looks like the error is in the fopen() function call. Make sure your path is correct.

    Good luck!


  10. amin2u said:

    have u test the searchtxt.php on windows platform? Is it working? I still got problem with fopen.


  11. farrelley said:

    Yes.. I have used it on windows. I will have to look at the code again and get back to you.


  12. amin2u said:

    i just have problems with the path…and today it’s still not work.. suppose the pdftotext will have EXE extension rite?


  13. farrelley said:

    That’s correct. You need to download the .exe files form the site I posted above. If you downloaded the source that I included with the project you will just get the linux versions of the pdftotext.


  14. amin2u said:

    do i have to specified the exe in exec stmt?


  15. farrelley said:

    @amin2u: Yes you have to put the .exe in the exe stmt. You can’t just copy the same code that I put out because that was optimzed for a unix install. You need to get the correct exe files from pdftotxt and use correct paths and exe’s


  16. esca said:

    hi…

    i already read this article and run it with my own PDF files,
    but that i want to know
    is it zend lucene already have Zend_Pdf_FileParser subclass
    is it the same with your articles ???

    can u explain

    thanx before


  17. Slavi said:

    Hi,

    Thanks for sharing your thoughts on Zend_Search_Lucene.
    I want to point to this:


    # $doc->addField(Zend_Search_Lucene_Field::Keyword(‘id’, $i)); //Stores the ID

    src: http://framework.zend.com/manual/en/zend.search.lucene.best-practice.html
    —— quote —
    Nevertheless it’s a good idea not to use ‘id’ and ‘score’ names to avoid ambiguity in QueryHit properties names.

    The Zend_Search_Lucene_Search_QueryHit id and score properties always refer to internal Lucene document id and hit score. If the indexed document has the same stored fields, you have to use the getDocument() method to access them.
    —— quote —


  18. ZEMZEMI said:

    Hi,
    Thanks you for your source code, It was very helpful for me. But, I have some questions : did XPDF retrieve Keywords Metadata ? I don’t think so because I tested your code with XPDF library and it shows authors, title, date…
    I want to konw if I can retrieve keywords with zend_search_lucene.

    thanks


  19. farrelley said:

    @ZEMZEMI – I’m not sure if XPDF gets the metadata or not. if it does you can just store them in the index. and retrieve them when you loop through the hits.


  20. ZEMZEMI said:

    @Farrelly – thank you for your help; I want to know if I can retireve keywords and abstract with XPDF.


  21. hind14 said:

    if someone can tell me when we should put the code (in a controller, view …) so when you talk about ‘path’? that is to say the installation path xpdf or whatever, need your suggestions plz plz

Leave a Reply