Part II – Indexing PDF Documents with Zend_Search_Lucene

Part II

Now that we know how to index PDF documents we need to be about to search that index and return relevant results. When the results are returned we also want open the specified PDF to the correct page.

The Code

Fist like in the last example we have to tell the server what directory we are going to work in and where to find the Zend Search Lucene.

< ?php/** Zend_Search_Lucene */

ini_set('include_path','/PATH/TO/DIR/');

require_once 'Zend/Search/Lucene.php';

?>

Continue reading

McCain Blogette Rss-less

I have been following John McCain’s daughters blog “The McCain Blogette” but find it very hard to keep up with it because there is no RSS Feed. It is very frustrating to keep up with blogs and news when there are no RSS feeds. It kills me that i have to create a folder on my toolbar in Firefox and have 100 sites in it because they don’t have a feed. It’s much easier just to open up Google Reader and start reading.

Justin has posted a post on “My Favorite Web Sites Are The Ones I Visit The Least” where he says

Recently, I had the realization that my favorite Web sites were the ones I visited the least. The first thing I do when I find a Web site that I want to visit again is look for an RSS feed. If I find one, chances are I won’t come to the Web site again for a while. If the Web site doesn’t have an RSS feed, I’ll forget about it and the site will miss out on some traffic.

I agree with Justin on this. I hardly ever go to a site that has an RSS feed. I just hope that developers start to realize that RSS feeds are a good thing and not a bad thing. I hear a lot of site owners saying we had “x” amount of hits today but when are we going to start saying we have “x” amount of subscribers to our feed?

Indexing PDF’s – The Why?

PDF IconAfter I published my small article on “Indexing PDF Documents with Zend_Search_Lucene” I was surprised to find it on the Zend Developer Zone blog. I had no idea that this would get the attention that it did and I thank everyone for checking it out. So now that you know how you would index a PDF, you may be asking why the heck would you do this?

LuceneMany companies large and small have support centers, either be in internal help desks or external help desks. In addition to the help desk, many companies publish PDF documents such as manuals, specs, services guides, and setup/connections guides etc. So instead of a help desk employee (or anyone) remembering what manual does what and what page everything is on, you can simple index these PDF files for easy searching. Just think about it this way, say you have 50 products all having 5 manuals each, that’s 250 manuals that you have to keep track of (not including how many pages each manual has). The easy way would be to index the PDF’s, add the necessary metadata to the manual, build a search form around a web page and wa-la. You have a easy way to search PDF files finding information quickly for a customer or whoever, and saving loads of time searching page by page for the same information.

Many companies do this, and many companies bloat how they are the best at doing it. So next time you are looking for a searchable PDF solution, remember that anyone can do this and it’s easy to do yourself.

Indexing PDF Documents with Zend_Search_Lucene

Part I

I along with many others have been trying and asking how to index and search PDF files. Once Zend released its Framework, which is a port of Java Lucene to PHP, I decided to jump on board and find a way to index and search PDF files. So… Lets get started.

The Setup…

We will use XPDF to parse the PDF files and Zend_Search_Lucene for indexing and searching. First we need to read the PDF files and get relevant PDF information such as Title, Author, Modification Date, PDF size, and number of pages in the file. To do all this, download XPDF and copy the pdftotext and pdfinfo into a directory on your web server. The pdfinfo call will get us metadata of the PDF and the pdftotext will convert the PDF to a text file. Converting the PDF to a text file will allow us to store the data in the index.

Next we will have to download the Zend Framework and copy it over to the web server. You can just copy over the Zend_Search folder and the necessary functions but for now we will make it easy and copy everything to the server.

The Code…

Lets start by parsing a file and storing the data into the index file. First we need to make sure that our include path is pointed to the working directory and that we can access the Zend Search Lucene functions.

/** Zend_Search_Lucene */
ini_set('include_path','/path/to/working/directory/);
require_once 'Zend/Search/Lucene.php';

Continue reading