August 2008

Using Nutch to download large binary media and image files

Here’s a recipe for using Nutch to crawl some site(s) and extract out the images.  I’m blogging this because I couldn’t find (no surprise here, sigh) any documentation or complete examples via mailing list archives for how to do this.

Step 1: modify Nutch URL filters

Okay, so first thing, modify $NUTCH_HOME/conf/crawl-urlfilter.txt .  Let’s assume you only care about JPEG images, change this line:

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

to this:

-\.(gif|GIF|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$

also update the “MY.DOMAIN.NAME” section appropriately.

Step 2: set up crawl configuration

Edit $NUTCH_HOME/conf/nutch-site.xml. You want to update/add properties for http.content.limit and file.content.limit so that your big files don’t get truncated. Look at $NUTCH_HOME/conf/nutch-default.xml for examples of how to do this. You might also want to adjust Protocol.CHECK_ROBOTS ;)

Step 3: crawl

I’m not going to go into this here as it is well-covered elsewhere. Basically you just want to make a list of seed URLs, then let nutch do its thing, e.g. like:

$NUTCH_HOME/bin/nutch crawl /home/allenday/urls -dir /home/allenday/crawled -depth 5

This is going to generate some directories under /home/allenday/crawled/segments.

Step 4: massage crawl outputs and extract images

Merge the crawl segments into one big segment. This makes the following steps easier.

$NUTCH_HOME/bin/nutch mergesegs /tmp/merged /home/allenday/crawled/segments

Now dump the segment and show the image URLs fro the crawl.

$NUTCH_HOME/bin/nutch readseg -dump /tmp/merged/* /tmp/dump
$NUTCH_HOME/bin/hadoop dfs -cat /tmp/dump/dump | grep -aE 'URL'

The grep should show something like this:

URL:: http://spicylogic.com/some-url.html
URL:: http://spicylogic.com/some-url.jpg

Obviously you’re interested in grepping for jpg, jpeg, etc. Do it.

Once you have the image list, you can use this little Java program to pull the images out of the segment one by one.

package com.spicylogic.allenday;
 
//JDK imports
import java.io.ByteArrayInputStream;
import java.io.DataInput;
import java.io.DataInputStream;
import java.io.DataOutput;
import java.io.DataOutputStream;
import java.io.IOException;
import java.util.Arrays;
import java.util.zip.InflaterInputStream;
 
//Hadoop imports
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.ArrayFile;
import org.apache.hadoop.io.DataOutputBuffer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.ValueBytes;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.UTF8;
import org.apache.hadoop.io.VersionMismatchException;
import org.apache.hadoop.io.Writable;
 
//Nutch imports
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.MimeUtil;
import org.apache.nutch.util.NutchConfiguration;
 
public final class ExtractFile {
  public static void main(String argv[]) throws Exception {
 
    String usage = "Content (-local | -dfs <namenode:port>) url segment";
 
    if (argv.length < 3) {
      System.out.println("usage:" + usage);
      return;
    }
    Configuration conf = NutchConfiguration.create();
    FileSystem fs = FileSystem.parseArgs(argv, 0, conf);
    try {
      String segment = argv[1];
 
      Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
      SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
 
      Text key = new Text();
      Content content = new Content();
 
      while (reader.next(key, content)) {
        //System.err.println( key + "\t=\t" + argv[0] );
        if (key.equals(new Text(argv[0]))) {
          System.out.write( content.getContent(), 0, content.getContent().length );
          break;
        }
      }
      reader.close();
    } finally {
      fs.close();
    }
  }
}

Compile it, then you can do like so:

$NUTCH_HOME/bin/hadoop --config YOUR:CLASS:PATH com.spicylogic.allenday.ExtractFile http://spicylogic.com/some-url.jpg /tmp/merged/* > out.jpg

Hope that helps. Let me know if you have corrections/clarifications (or a complete script!) for this post and I’ll be happy to merge them in with attribution.

Thanks for this post go to Kaz Muzik, who was doing something similar to back up his blog. Also, the com.spicylogic.allenday.ExtractFile class is based on org.apache.nutch.protocol.Content class.

Java
Nutch

Comments (0)

Permalink

HBase bulk load/import example

Per my earlier post, I’ve almost finished an (actually compilable, functional) bulk loader example. Should be able to post it tomorrow afternoon to the HBase/MapReduce page, assuming I don’t get stuck in interviews/meetings all day.

HBase
Hadoop
Java

Comments (0)

Permalink

More thoughts on EC2 / EBS / Hadoop

I’m still getting up to speed on running Hadoop on EC2. Found this AWS post today describing how to easily port data into a hadoop cluster from S3, as well as easily create new Hadoop slaves using the AMI system images, start up clusters, and tear down clusters.

I made some comments yesterday about wanting to be able to scale the Hadoop cluster down as well as up, in particular being able to disable cores, which are the really expensive part of running a cluster on AWS.

Now we need to look into the AWS scripts and AMI images that are available to see how feasible it is to just maintain more data volumes than images. What I’m (roughly) thinking is that we might set up M data volumes for the DFS, but might want to run 1 <= N <= data/task nodes. In the case that N < M, you just load some of the N nodes with more than 1 EBS volume.

Also need to look into how HDFS deals with adding new volumes, i.e. will it just start replicating data onto nodes as they are added into the system? Is there a way to hot-add rather than restarting the data master? Hot-adding EBS volumes onto existing data nodes?

Distributed Systems
Java
Scalability

Comments (0)

Permalink

Darren Rush blogging

Looks like Darren has starting a blog. Awesome. Adding it to my link roll as soon as I figure out how.

I’m looking forward to that post comparing monitoring systems!

Random musings

Comments (0)

Permalink

EC2 + EBS + Hadoop at BiggerBoat

Rodger and I at BiggerBoat got Hadoop and HBase up and running on Amazon EC2 today. We initially set up a cluster of 1 master and 10 slaves. After a quick calculation of how much this costs to keep running 24/7, we started trying to figure out how to scale the thing DOWN as well as UP, and to be able to do so dynamically. Seems like the tricky piece is the Hadoop storage, not so much the compute power available. Amazon just launched their Elastic Block Store a few days ago, so we’re seeing how that fits in. Seems like the EBS I/O is pretty good given our Bonnie++ tests.

Tom White has some architecture scenarios for building this kind of stuff.

Distributed Systems
Java
Scalability

Comments (1)

Permalink

Examples for data import, export, and transport with HBase

I’m in the process of setting up an analytic workflow at BiggerBoat. It’s looking like the main theme in data structures around here will be the sparse matrix. So I’ve been playing with opensource technologies for sparse matrices. Apache Hadoop’s HBase is looking like a good choice for now, maybe Hive later.

Right now I’m getting familiar with the former. As part of this, I’m improving the docs on the wiki to make them more user- (as opposed to core developer-) friendly. My documentation goal right now is to add some data transformation example code. There are already lots of hadoop examples for doing text -&gt text mapping, e.g. grep, cat, etc. For HBase not so much. I.e.

  • text to text (done, many examples
  • flatfile to HBase table (Bulk loader in the HBase wiki, I haven’t tried it yet)
  • HBase table to flatfile
  • HBase table to HBase table

I’ll be adding updated, complete, and simple code for the latter two (three?) in the next few days to the HBase/MapReduce page.

Analytics
Distributed Systems
Java

Comments (1)

Permalink

Cookies, IP Addresses and Unique Users

I’ve been thinking about how to track unique users today. These are my so-far-unorganized thoughts. Please comment!

You can’t track users by cookie alone for a couple of reasons: 1) they might use multiple computers, 2) they might delete cookies, 3) multiple users might share the same computer (same account=same cookie)

You also can’t track users by IP address alone for some more reasons: 1) they might be using a mobile device or portable computer that moves from IP to IP, 2) there could be multiple machines passing through a single gateway IP (i.e. LAN NAT).

However, if you combine the cookie/IP information together, you can start to address some of these issues. Let’s assume you have some webserver logs that minimally contain <IP address "A">, <cookie ID "C">, <timestamp T> triples.

            === time ===>
PATTERN 1:
C1-A1 ---      ---
C1-A2    ---         ---
C1-A3       ---   ---

PATTERN 2:
C1-A1 ------
C2-A1       ------
C1-A2             ------

PATTERN 3:
C1-A1 -----!
C2-A1       -----!
C3-A1             ------

PATTERN 4:
C1-A1 ------      ------
C2-A1     ----------

PATTERN 5:
C1-A1 -----     -----
C2-A1      -----     ---

This matrix indicates the compatibility of each of the patterns (P1-P5)
with several different classes of cookie/IP address combination that we
might want to detect.

                                     patterns
                               P1  P2  P3  P4  P5
profiles                      --------------------
multiple users per IP        | -   +   -   +   +
multiple users per cookie    | -   -   -   -   -
multiple IPs per user        | +   +   -   -   -
multiple cookies per user    | -   -   +   -   +
cookie deletion              | -   -   +   -   -
"permanent" IP change        | -   +   -   -   -

Note that none of these patterns gives any indication for the “multiple
users per cookie” profile. To assess if there is more than one
user/cookie, you might want to look at the context in which you’re
observing the cookie. Consider attributes like (timezone corrected)
time-of-day, day-of-week, type of content being viewed.

Analytics
Random musings

Comments (0)

Permalink

Hadoop / SGE Grid Engine Convergence

I’m an old hand with SGE and a more user of Hadoop / Pig.  Good to see that there is interest in making these technologies interoperate.

Distributed Systems
Java
Scalability
Science

Comments (0)

Permalink