<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Allen Day's Blog &#187; Nutch</title>
	<atom:link href="http://www.spicylogic.com/allenday/blog/category/computing/distributed-systems/hadoop/nutch/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.spicylogic.com/allenday/blog</link>
	<description>♥data♥</description>
	<lastBuildDate>Mon, 21 Jun 2010 23:28:18 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Using Nutch to download large binary media and image files</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/#comments</comments>
		<pubDate>Fri, 29 Aug 2008 07:29:34 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Nutch]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/?p=55</guid>
		<description><![CDATA[Here&#8217;s a recipe for using Nutch to crawl some site(s) and extract out the images.  I&#8217;m blogging this because I couldn&#8217;t find (no surprise here, sigh) any documentation or complete examples via mailing list archives for how to do this.
Step 1: modify Nutch URL filters
Okay, so first thing, modify $NUTCH_HOME/conf/crawl-urlfilter.txt .  Let&#8217;s assume you only [...]]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a recipe for using Nutch to crawl some site(s) and extract out the images.  I&#8217;m blogging this because I couldn&#8217;t find (no surprise here, sigh) any documentation or complete examples via mailing list archives for how to do this.</p>
<h2>Step 1: modify Nutch URL filters</h2>
<p>Okay, so first thing, modify <code>$NUTCH_HOME/conf/crawl-urlfilter.txt</code> .  Let&#8217;s assume you only care about JPEG images, change this line:</p>

<div class="wp_syntax"><div class="code"><pre>-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$</pre></div></div>

<p>to this:</p>

<div class="wp_syntax"><div class="code"><pre>-\.(gif|GIF|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$</pre></div></div>

<p>also update the &#8220;MY.DOMAIN.NAME&#8221; section appropriately.</p>
<h2>Step 2: set up crawl configuration</h2>
<p>Edit <code>$NUTCH_HOME/conf/nutch-site.xml</code>.  You want to update/add properties for <code>http.content.limit</code> and <code>file.content.limit</code> so that your big files don&#8217;t get truncated.  Look at <code>$NUTCH_HOME/conf/nutch-default.xml</code> for examples of how to do this.  You might also want to adjust <code>Protocol.CHECK_ROBOTS</code> <img src='http://www.spicylogic.com/allenday/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<h2>Step 3: crawl</h2>
<p>I&#8217;m not going to go into this here as it is well-covered elsewhere.  Basically you just want to make a list of seed URLs, then let nutch do its thing, e.g. like:</p>

<div class="wp_syntax"><div class="code"><pre>$NUTCH_HOME/bin/nutch crawl /home/allenday/urls -dir /home/allenday/crawled -depth 5</pre></div></div>

<p>This is going to generate some directories under <code>/home/allenday/crawled/segments</code>.  </p>
<h2>Step 4: massage crawl outputs and extract images</h2>
<p>Merge the crawl segments into one big segment.  This makes the following steps easier.</p>

<div class="wp_syntax"><div class="code"><pre>$NUTCH_HOME/bin/nutch mergesegs /tmp/merged /home/allenday/crawled/segments</pre></div></div>

<p>Now dump the segment and show the image URLs fro the crawl.</p>

<div class="wp_syntax"><div class="code"><pre>$NUTCH_HOME/bin/nutch readseg -dump /tmp/merged/* /tmp/dump
$NUTCH_HOME/bin/hadoop dfs -cat /tmp/dump/dump | grep -aE 'URL'</pre></div></div>

<p>The grep should show something like this:</p>
<pre>
URL:: http://spicylogic.com/some-url.html
URL:: http://spicylogic.com/some-url.jpg
</pre>
<p>Obviously you&#8217;re interested in grepping for jpg, jpeg, etc.  Do it.</p>
<p>Once you have the image list, you can use this little Java program to pull the images out of the segment one by one.</p>

<div class="wp_syntax"><div class="code"><pre class="java"><span style="color: #000000; font-weight: bold;">package</span> com.<span style="color: #006600;">spicylogic</span>.<span style="color: #006600;">allenday</span><span style="color: #66cc66;">;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">//JDK imports</span>
<span style="color: #a1a100;">import java.io.ByteArrayInputStream;</span>
<span style="color: #a1a100;">import java.io.DataInput;</span>
<span style="color: #a1a100;">import java.io.DataInputStream;</span>
<span style="color: #a1a100;">import java.io.DataOutput;</span>
<span style="color: #a1a100;">import java.io.DataOutputStream;</span>
<span style="color: #a1a100;">import java.io.IOException;</span>
<span style="color: #a1a100;">import java.util.Arrays;</span>
<span style="color: #a1a100;">import java.util.zip.InflaterInputStream;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">//Hadoop imports</span>
<span style="color: #a1a100;">import org.apache.hadoop.conf.Configuration;</span>
<span style="color: #a1a100;">import org.apache.hadoop.fs.FileSystem;</span>
<span style="color: #a1a100;">import org.apache.hadoop.fs.Path;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.ArrayFile;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.DataOutputBuffer;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.IntWritable;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.SequenceFile;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.SequenceFile.ValueBytes;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.Text;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.UTF8;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.VersionMismatchException;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.Writable;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">//Nutch imports</span>
<span style="color: #a1a100;">import org.apache.nutch.metadata.Metadata;</span>
<span style="color: #a1a100;">import org.apache.nutch.protocol.Content;</span>
<span style="color: #a1a100;">import org.apache.nutch.util.MimeUtil;</span>
<span style="color: #a1a100;">import org.apache.nutch.util.NutchConfiguration;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000000; font-weight: bold;">class</span> ExtractFile <span style="color: #66cc66;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #993333;">void</span> main<span style="color: #66cc66;">&#40;</span><span style="color: #aaaadd; font-weight: bold;">String</span> argv<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #aaaadd; font-weight: bold;">Exception</span> <span style="color: #66cc66;">&#123;</span>
&nbsp;
    <span style="color: #aaaadd; font-weight: bold;">String</span> usage = <span style="color: #ff0000;">&quot;Content (-local | -dfs &amp;lt;namenode:port&amp;gt;) url segment&quot;</span><span style="color: #66cc66;">;</span>
&nbsp;
    <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span>argv.<span style="color: #006600;">length</span> <span style="color: #66cc66;">&amp;</span>lt<span style="color: #66cc66;">;</span> <span style="color: #cc66cc;">3</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
      <span style="color: #aaaadd; font-weight: bold;">System</span>.<span style="color: #006600;">out</span>.<span style="color: #006600;">println</span><span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">&quot;usage:&quot;</span> + usage<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
      <span style="color: #000000; font-weight: bold;">return</span><span style="color: #66cc66;">;</span>
    <span style="color: #66cc66;">&#125;</span>
    Configuration conf = NutchConfiguration.<span style="color: #006600;">create</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
    FileSystem fs = FileSystem.<span style="color: #006600;">parseArgs</span><span style="color: #66cc66;">&#40;</span>argv, <span style="color: #cc66cc;">0</span>, conf<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
    <span style="color: #000000; font-weight: bold;">try</span> <span style="color: #66cc66;">&#123;</span>
      <span style="color: #aaaadd; font-weight: bold;">String</span> segment = argv<span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">;</span>
&nbsp;
      Path file = <span style="color: #000000; font-weight: bold;">new</span> Path<span style="color: #66cc66;">&#40;</span>segment, Content.<span style="color: #006600;">DIR_NAME</span> + <span style="color: #ff0000;">&quot;/part-00000/data&quot;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
      SequenceFile.<span style="color: #aaaadd; font-weight: bold;">Reader</span> reader = <span style="color: #000000; font-weight: bold;">new</span> SequenceFile.<span style="color: #aaaadd; font-weight: bold;">Reader</span><span style="color: #66cc66;">&#40;</span>fs, file, conf<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
&nbsp;
      Text key = <span style="color: #000000; font-weight: bold;">new</span> Text<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
      Content content = <span style="color: #000000; font-weight: bold;">new</span> Content<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
&nbsp;
      <span style="color: #b1b100;">while</span> <span style="color: #66cc66;">&#40;</span>reader.<span style="color: #006600;">next</span><span style="color: #66cc66;">&#40;</span>key, content<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
        <span style="color: #808080; font-style: italic;">//System.err.println( key + &quot;\t=\t&quot; + argv[0] );</span>
        <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span>key.<span style="color: #006600;">equals</span><span style="color: #66cc66;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Text<span style="color: #66cc66;">&#40;</span>argv<span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
          <span style="color: #aaaadd; font-weight: bold;">System</span>.<span style="color: #006600;">out</span>.<span style="color: #006600;">write</span><span style="color: #66cc66;">&#40;</span> content.<span style="color: #006600;">getContent</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>, <span style="color: #cc66cc;">0</span>, content.<span style="color: #006600;">getContent</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>.<span style="color: #006600;">length</span> <span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
          <span style="color: #000000; font-weight: bold;">break</span><span style="color: #66cc66;">;</span>
        <span style="color: #66cc66;">&#125;</span>
      <span style="color: #66cc66;">&#125;</span>
      reader.<span style="color: #006600;">close</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
    <span style="color: #66cc66;">&#125;</span> <span style="color: #000000; font-weight: bold;">finally</span> <span style="color: #66cc66;">&#123;</span>
      fs.<span style="color: #006600;">close</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
    <span style="color: #66cc66;">&#125;</span>
  <span style="color: #66cc66;">&#125;</span>
<span style="color: #66cc66;">&#125;</span></pre></div></div>

<p>Compile it, then you can do like so:</p>

<div class="wp_syntax"><div class="code"><pre>$NUTCH_HOME/bin/hadoop --config YOUR:CLASS:PATH com.spicylogic.allenday.ExtractFile http://spicylogic.com/some-url.jpg /tmp/merged/* &amp;gt; out.jpg</pre></div></div>

<p>Hope that helps.  Let me know if you have corrections/clarifications (or a complete script!) for this post and I&#8217;ll be happy to merge them in with attribution.</p>
<p>Thanks for this post go to <a href="http://kazmuzik.net/lj/77261.html">Kaz Muzik</a>, who was doing something similar to back up his blog.  Also, the <code>com.spicylogic.allenday.ExtractFile</code> class is based on <code>org.apache.nutch.protocol.Content</code> class.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
