<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Re:vist  Indexing PDF Documents with Zend Search Lucene</title>
	<atom:link href="http://www.kapustabrothers.com/2008/06/16/revist-indexing-pdf-documents-with-zend-search-lucene/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.kapustabrothers.com/2008/06/16/revist-indexing-pdf-documents-with-zend-search-lucene/</link>
	<description></description>
	<lastBuildDate>Tue, 15 Nov 2011 15:00:31 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: farrelley</title>
		<link>http://www.kapustabrothers.com/2008/06/16/revist-indexing-pdf-documents-with-zend-search-lucene/comment-page-1/#comment-12357</link>
		<dc:creator>farrelley</dc:creator>
		<pubDate>Wed, 18 Jun 2008 00:06:33 +0000</pubDate>
		<guid isPermaLink="false">http://www.kapustabrothers.com/2008/06/16/revist-indexing-pdf-documents-with-zend-search-lucene/#comment-12357</guid>
		<description>Yeah... I also find a site called http://www.pdfbox.org/</description>
		<content:encoded><![CDATA[<p>Yeah&#8230; I also find a site called <a href="http://www.pdfbox.org/" rel="nofollow">http://www.pdfbox.org/</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Doug</title>
		<link>http://www.kapustabrothers.com/2008/06/16/revist-indexing-pdf-documents-with-zend-search-lucene/comment-page-1/#comment-12356</link>
		<dc:creator>Doug</dc:creator>
		<pubDate>Wed, 18 Jun 2008 00:03:30 +0000</pubDate>
		<guid isPermaLink="false">http://www.kapustabrothers.com/2008/06/16/revist-indexing-pdf-documents-with-zend-search-lucene/#comment-12356</guid>
		<description>In a sub-project of Lucent there is an application called Nutch... Take a look at /nutch/plugins/parse-pdf/plugin.xml.

It takes the contentType of application/pdf and if the default site.xml has this plugin enabled, it will automatically create a Lucene index of the content of the pdf&#039;s during the crawl.  No need to dump the text of the pdf&#039;s to a txt file and then index those files...  Hope this helps.... The deeper into this I get, the more fascinating it becomes.</description>
		<content:encoded><![CDATA[<p>In a sub-project of Lucent there is an application called Nutch&#8230; Take a look at /nutch/plugins/parse-pdf/plugin.xml.</p>
<p>It takes the contentType of application/pdf and if the default site.xml has this plugin enabled, it will automatically create a Lucene index of the content of the pdf&#8217;s during the crawl.  No need to dump the text of the pdf&#8217;s to a txt file and then index those files&#8230;  Hope this helps&#8230;. The deeper into this I get, the more fascinating it becomes.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

