Re:vist Indexing PDF Documents with Zend Search Lucene

June 16th, 2008

I have decided that I am going to revive the Indexing PDF Documents with Zend Search Lucene article and see what else I can come up with. Maybe there is a better way to do it and or I can create a little application from it. I am open for suggestions and or comments. More to follow in the coming weeks.


2 Comments to “Re:vist Indexing PDF Documents with Zend Search Lucene”


  1. Doug said:

    In a sub-project of Lucent there is an application called Nutch… Take a look at /nutch/plugins/parse-pdf/plugin.xml.

    It takes the contentType of application/pdf and if the default site.xml has this plugin enabled, it will automatically create a Lucene index of the content of the pdf’s during the crawl. No need to dump the text of the pdf’s to a txt file and then index those files… Hope this helps…. The deeper into this I get, the more fascinating it becomes.


  2. farrelley said:

    Yeah… I also find a site called http://www.pdfbox.org/

Leave a Reply