Re:vist Indexing PDF Documents with Zend Search Lucene

I have decided that I am going to revive the Indexing PDF Documents with Zend Search Lucene article and see what else I can come up with. Maybe there is a better way to do it and or I can create a little application from it. I am open for suggestions and or comments. More to follow in the coming weeks.

2 thoughts on “Re:vist Indexing PDF Documents with Zend Search Lucene

  1. In a sub-project of Lucent there is an application called Nutch… Take a look at /nutch/plugins/parse-pdf/plugin.xml.

    It takes the contentType of application/pdf and if the default site.xml has this plugin enabled, it will automatically create a Lucene index of the content of the pdf’s during the crawl. No need to dump the text of the pdf’s to a txt file and then index those files… Hope this helps…. The deeper into this I get, the more fascinating it becomes.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>