Re:vist Indexing PDF Documents with Zend Search Lucene
June 16th, 2008
I have decided that I am going to revive the Indexing PDF Documents with Zend Search Lucene article and see what else I can come up with. Maybe there is a better way to do it and or I can create a little application from it. I am open for suggestions and or comments. More to follow in the coming weeks.
Doug said:
In a sub-project of Lucent there is an application called Nutch… Take a look at /nutch/plugins/parse-pdf/plugin.xml.
It takes the contentType of application/pdf and if the default site.xml has this plugin enabled, it will automatically create a Lucene index of the content of the pdf’s during the crawl. No need to dump the text of the pdf’s to a txt file and then index those files… Hope this helps…. The deeper into this I get, the more fascinating it becomes.
farrelley said:
Yeah… I also find a site called http://www.pdfbox.org/