Using Nutch to download large binary media and image files
Here’s a recipe for using Nutch to crawl some site(s) and extract out the images. I’m blogging this because I couldn’t find (no surprise here, sigh) any documentation or complete examples via mailing list archives for how to do this.
Step 1: modify Nutch URL filters
Okay, so first thing, modify $NUTCH_HOME/conf/crawl-urlfilter.txt . Let’s assume you only care about JPEG images, change this line:
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
to this:
-\.(gif|GIF|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$
also update the “MY.DOMAIN.NAME” section appropriately.
Step 2: set up crawl configuration
Edit $NUTCH_HOME/conf/nutch-site.xml. You want to update/add properties for http.content.limit and file.content.limit so that your big files don’t get truncated. Look at $NUTCH_HOME/conf/nutch-default.xml for examples of how to do this. You might also want to adjust Protocol.CHECK_ROBOTS
Step 3: crawl
I’m not going to go into this here as it is well-covered elsewhere. Basically you just want to make a list of seed URLs, then let nutch do its thing, e.g. like:
$NUTCH_HOME/bin/nutch crawl /home/allenday/urls -dir /home/allenday/crawled -depth 5
This is going to generate some directories under /home/allenday/crawled/segments.
Step 4: massage crawl outputs and extract images
Merge the crawl segments into one big segment. This makes the following steps easier.
$NUTCH_HOME/bin/nutch mergesegs /tmp/merged /home/allenday/crawled/segments
Now dump the segment and show the image URLs fro the crawl.
$NUTCH_HOME/bin/nutch readseg -dump /tmp/merged/* /tmp/dump $NUTCH_HOME/bin/hadoop dfs -cat /tmp/dump/dump | grep -aE 'URL'
The grep should show something like this:
URL:: http://spicylogic.com/some-url.html URL:: http://spicylogic.com/some-url.jpg
Obviously you’re interested in grepping for jpg, jpeg, etc. Do it.
Once you have the image list, you can use this little Java program to pull the images out of the segment one by one.
package com.spicylogic.allenday; //JDK imports import java.io.ByteArrayInputStream; import java.io.DataInput; import java.io.DataInputStream; import java.io.DataOutput; import java.io.DataOutputStream; import java.io.IOException; import java.util.Arrays; import java.util.zip.InflaterInputStream; //Hadoop imports import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.ArrayFile; import org.apache.hadoop.io.DataOutputBuffer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.SequenceFile.ValueBytes; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.UTF8; import org.apache.hadoop.io.VersionMismatchException; import org.apache.hadoop.io.Writable; //Nutch imports import org.apache.nutch.metadata.Metadata; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.MimeUtil; import org.apache.nutch.util.NutchConfiguration; public final class ExtractFile { public static void main(String argv[]) throws Exception { String usage = "Content (-local | -dfs <namenode:port>) url segment"; if (argv.length < 3) { System.out.println("usage:" + usage); return; } Configuration conf = NutchConfiguration.create(); FileSystem fs = FileSystem.parseArgs(argv, 0, conf); try { String segment = argv[1]; Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data"); SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf); Text key = new Text(); Content content = new Content(); while (reader.next(key, content)) { //System.err.println( key + "\t=\t" + argv[0] ); if (key.equals(new Text(argv[0]))) { System.out.write( content.getContent(), 0, content.getContent().length ); break; } } reader.close(); } finally { fs.close(); } } }
Compile it, then you can do like so:
$NUTCH_HOME/bin/hadoop --config YOUR:CLASS:PATH com.spicylogic.allenday.ExtractFile http://spicylogic.com/some-url.jpg /tmp/merged/* > out.jpg
Hope that helps. Let me know if you have corrections/clarifications (or a complete script!) for this post and I’ll be happy to merge them in with attribution.
Thanks for this post go to Kaz Muzik, who was doing something similar to back up his blog. Also, the com.spicylogic.allenday.ExtractFile class is based on org.apache.nutch.protocol.Content class.