Java

More thoughts on EC2 / EBS / Hadoop

I’m still getting up to speed on running Hadoop on EC2. Found this AWS post today describing how to easily port data into a hadoop cluster from S3, as well as easily create new Hadoop slaves using the AMI system images, start up clusters, and tear down clusters.

I made some comments yesterday about wanting to be able to scale the Hadoop cluster down as well as up, in particular being able to disable cores, which are the really expensive part of running a cluster on AWS.

Now we need to look into the AWS scripts and AMI images that are available to see how feasible it is to just maintain more data volumes than images. What I’m (roughly) thinking is that we might set up M data volumes for the DFS, but might want to run 1 <= N <= data/task nodes. In the case that N < M, you just load some of the N nodes with more than 1 EBS volume.

Also need to look into how HDFS deals with adding new volumes, i.e. will it just start replicating data onto nodes as they are added into the system? Is there a way to hot-add rather than restarting the data master? Hot-adding EBS volumes onto existing data nodes?

Distributed Systems
Java
Scalability

Comments (0)

Permalink

EC2 + EBS + Hadoop at BiggerBoat

Rodger and I at BiggerBoat got Hadoop and HBase up and running on Amazon EC2 today. We initially set up a cluster of 1 master and 10 slaves. After a quick calculation of how much this costs to keep running 24/7, we started trying to figure out how to scale the thing DOWN as well as UP, and to be able to do so dynamically. Seems like the tricky piece is the Hadoop storage, not so much the compute power available. Amazon just launched their Elastic Block Store a few days ago, so we’re seeing how that fits in. Seems like the EBS I/O is pretty good given our Bonnie++ tests.

Tom White has some architecture scenarios for building this kind of stuff.

Distributed Systems
Java
Scalability

Comments (1)

Permalink

Examples for data import, export, and transport with HBase

I’m in the process of setting up an analytic workflow at BiggerBoat. It’s looking like the main theme in data structures around here will be the sparse matrix. So I’ve been playing with opensource technologies for sparse matrices. Apache Hadoop’s HBase is looking like a good choice for now, maybe Hive later.

Right now I’m getting familiar with the former. As part of this, I’m improving the docs on the wiki to make them more user- (as opposed to core developer-) friendly. My documentation goal right now is to add some data transformation example code. There are already lots of hadoop examples for doing text -&gt text mapping, e.g. grep, cat, etc. For HBase not so much. I.e.

  • text to text (done, many examples
  • flatfile to HBase table (Bulk loader in the HBase wiki, I haven’t tried it yet)
  • HBase table to flatfile
  • HBase table to HBase table

I’ll be adding updated, complete, and simple code for the latter two (three?) in the next few days to the HBase/MapReduce page.

Analytics
Distributed Systems
Java

Comments (0)

Permalink

Hadoop / SGE Grid Engine Convergence

I’m an old hand with SGE and a more user of Hadoop / Pig.  Good to see that there is interest in making these technologies interoperate.

Distributed Systems
Java
Scalability
Science

Comments (0)

Permalink

Notes on setting up Taste


Setting up Taste v1.7.2 on a CentOS 4 x86_64 box.

Taste has merged with Mahout now, but I still want to do this standalone b/c I’m having trouble getting the JUnit tests to pass for Mahout. With that out of the way…

These are the shell commands I assembled after following the Taste Demo guide.

#make sure you have ant, and the JDK.  I don't recommend the CentOS stock, get them from Sun/Apache
#download necessary .jar files, sources, data files.  unpack/move them to correct locations.
wget http://internap.dl.sourceforge.net/sourceforge/taste/taste-1.7.2.zip
wget http://internap.dl.sourceforge.net/sourceforge/proguard/proguard4.2.zip
wget http://www.grouplens.org/system/files/million-ml-data.tar__0.gz
wget http://www.hightechimpact.com/Apache/tomcat/tomcat-5/v5.5.26/bin/apache-tomcat-5.5.26.tar.gz
unzip taste-1.7.2.zip
unzip proguard4.2.zip
tar -xvzf million-ml-data.tar__0.gz
tar -xvzf apache-tomcat-5.5.26.tar.gz
cp proguard4.2/lib/proguard.jar lib/
mv [mr]*.dat src/example/com/planetj/taste/example/grouplens/
#start up tomcat on port 8080 (default)
JAVA_OPTS="-server -da -dsa -Xms1024m -Xmx1024m" JAVA_HOME=/usr/java/jdk1.6.0_02 sh apache-tomcat-5.5.26/bin/startup.sh
#build taste.war, and inject it into tomcat
JDK_HOME=/usr/java/jdk1.6.0_02 JAVA_HOME=/usr/java/jdk1.6.0_02 ant build-grouplens-example
cp taste.war apache-tomcat-5.5.26/webapps/
#test the app.  may take a minute or two on the first query.
wget -O - -S 'http://localhost:8080/taste/RecommenderServlet?userID=1&amp;debug=true'

Once you get that working, you can tweak the demo slightly to work on another data set. You just need to know the grouplens file format. ratings.dat is of the format:

UserID::MovieID::Rating::Timestamp

e.g.

1::1193::5::978300760

and movies.dat is of the format:

MovieID::Title::Genres

e.g.

1::Toy Story (1995)::Animation|Children's|Comedy

I wrote a script, let’s call it load_taste.pl, that can generate new movies.dat and ratings.dat files from an alternate data source. If I make these new files, I can drop them in place of the grouplens data, rebuild the .war files, and make recommendations on this other data set. Here’s how to do it:

#generate ratings.dat and movies.dat.  move them to replace the grouplens data files.
perl ./load_taste.pl
mv [mr]*.dat src/example/com/planetj/taste/example/grouplens/
#get rid of stale .war and .jar files
rm taste.war grouplens.jar
#build the "quick" version of the example.  see below for build.xml patch
JDK_HOME=/usr/java/jdk1.6.0_02 JAVA_HOME=/usr/java/jdk1.6.0_02 ant build-grouplens-example-quick
#inject the re-built .war file into tomcat.
cp taste.war apache-tomcat-5.5.26/webapps/
#get rid of stale tomcat caches
rm -rf apache-tomcat-5.5.26/webapps/taste apache-tomcat-5.5.26/temp/taste.*.txt

Note that I’ve defined a new ant build target called “build-grouplens-example-quick”. The purpose of this is that we only want to rebuild grouplens.jar and taste.war, not reoptimize/reverify/rebuild taste.jar, etc. The “build-grouplens-example” target takes ~55 seconds to complete on my machine, whereas the “build-grouplens-example-quick” target takes ~2 seconds. Here’s a diff to the original build.xml file:

--- /tmp/build.xml      2008-03-21 21:18:20.000000000 -0700
+++ ./build.xml 2008-06-30 11:46:18.000000000 -0700
@@ -161,6 +161,58 @@
      <delete file="${my-web.xml}"/>
   </target>
 
+  <target depends="" name="build-taste-server-quick" description="Builds deployable web-based Taste server">
+     <fail unless="my-recommender.jar" message="Please set -Dmy-recommender.jar=XXX"/>
+     <fail unless="my-recommender-class" message="Please set -Dmy-recommender-class=XXX"/>
+     <tempfile property="my-web.xml"/>
+     <copy file="src/main/com/planetj/taste/web/web.xml" tofile="${my-web.xml}">
+       <filterset>
+               <filter token="RECOMMENDER_CLASS" value="${my-recommender-class}"/>
+       </filterset>
+     </copy>
+     <war destfile="${release-war}" webxml="${my-web.xml}">
+       <lib dir=".">
+               <include name="${release-jar}"/>
+               <include name="${my-recommender.jar}"/>
+       </lib>
+       <lib dir="lib/axis"/>
+       <classes dir="build">
+               <include name="com/planetj/taste/web/**"/>
+       </classes>
+       <fileset dir="src/main/com/planetj/taste/web">
+               <include name="RecommenderService.jws"/>
+       </fileset>
+     </war>
+     <delete file="${my-web.xml}"/>
+  </target>
+  <target depends="" name="build-grouplens-example-quick" description="Builds deployable GroupLens example">
+     <javac source="1.5"
+            target="1.5"
+            deprecation="true"
+          debug="true"
+          optimize="false"
+            destdir="build"
+            srcdir="src/example">
+       <compilerarg value="-Xlint:all"/>
+       <classpath>
+               <pathelement location="${release-jar}"/>
+               <pathelement location="${annotations.jar}"/>
+       </classpath>
+     </javac>
+     <jar jarfile="grouplens.jar">
+       <fileset dir="src/example">
+               <include name="com/planetj/taste/example/grouplens/ratings.dat"/>
+               <include name="com/planetj/taste/example/grouplens/movies.dat"/>
+       </fileset>
+       <fileset dir="build">
+               <include name="com/planetj/taste/example/grouplens/**"/>
+       </fileset>
+     </jar>
+     <property name="my-recommender.jar" value="grouplens.jar"/>
+     <property name="my-recommender-class" value="com.planetj.taste.example.grouplens.GroupLensRecommender"/>
+     <antcall target="build-taste-server-quick"/>
+  </target>
+
   <target depends="build,optimize" name="build-grouplens-example" description="Builds deployable GroupLens example">
      <javac source="1.5"
             target="1.5"

Informatics
Java
Statistics

Comments (2)

Permalink

Notes on setting up hadoop on CentOS 5 x86

Overview

This documents my experience setting up hadoop on three 2-cpu hyperthreaded Centos 5 x86 boxes.

The machines being used are called:

  • hadoop-0-0
  • hadoop-0-1
  • hadoop-0-2

System Setup

Unless otherwise stated, each task was executed on all three machines.

Set up hadoop user

I added a “hadoop” user up on each node, and untarred the hadoop source directly in to the /home/hadoop directory.  So there was a /home/hadoop/conf, /home/hadoop/bin, etc.

useradd hadoop
passwd hadoop
cd /tmp
wget http://apache.siamwebhosting.com/hadoop/core/hadoop-0.17.0/hadoop-0.17.0.tar.gz
su hadoop
cd ~
tar -xvzf /tmp/hadoop*gz
mv hadoop*/* ./
rmdir hadoop*
exit

Next, I set up passphrase-less ssh. On hadoop-0-0, I set up a DSA key:

su hadoop
ssh-keygen -t dsa
#[...use blank passphrase...]
scp ~/.ssh/id_dsa.pub hadoop-0-1:/tmp/
scp ~/.ssh/id_dsa.pub hadoop-0-2:/tmp/

Then, on hadoop-0-1 and hadoop-0-2:

su hadoop
mkdir ~/.ssh
chmod 700 ~/.ssh
cp /tmp/id_dsa.pub ~/.ssh/authorized_keys
chmod 644 ~/.ssh/authorized_keys

This allows login from hadoop-0-0 to both hadoop-0-1 and hadoop-0-2 without typing a password. It’s necessary.

Install prerequisite software

ssh and rsync are required. You can install them like so:

yum -y install ssh rsync

I installed Java 6 from Sun (it’s important to mention here that the CentOS yum/JPackage RPMs for gcc-java, etc did not work for Mahout, so it had to be the Sun Java).

rpm -Uvh j*rpm

Setup system services

I turned off iptables and ip6tables, and disabled them on startup. You could configure them, but I just turned them off. They get in the way of the nodes communicating.

/etc/init.d/iptables stop
/etc/init.d/ip6tables stop
/usr/sbin/ntsysv
#[...]

Then, I edited the hosts file. I actually did this later in the process after a bunch of debugging, but it makes sense to do it here. You need to make sure that the name by which you refer to a host is not associated with the loopback IP address (127.0.0.1). So your /etc/hosts file should look something like this on hadoop-0-0:

127.0.0.1    localhost localhost.localdomain
::1    localhost6.localdomain6 localhost6
10.0.0.1    hadoop-0-0

where the 10.0.0.1 entry is optional if you have some other way to resolve it, e.g. DNS. The key point is that the hadoop-0-0 name not be associated with 127.0.0.1.

Configuring hadoop in standalone mode

I tried using the Hadoop Quick Start guide first. It was trivial to get working in standalone mode. The documentation is sufficient, I won’t discuss it further.

Configuring hadoop in cluster mode

I tried using the Hadoop Cluster Setup guide next. It describes there are four types of hadoop services, split across three types of machines:

  • NameNode machine, runs NameNode service
  • JobTracker machine, runs JobTracker service
  • Slave machine, runs TaskTracker and DataNode services

I wanted to have 2 slave nodes.  So I decided to split it up like:

  • hadoop-0-0 - NameNode and JobTracker
  • hadoop-0-1 - DataNode and TaskTracker
  • hadoop-0-2 - DataNode and TaskTracker

Here’s what my config files look like on all three machines:

/home/hadoop/conf/hadoop-env.sh: added one line.

export JAVA_HOME=/usr/java/jdk1.6.0_06

/home/hadoop/conf/masters:

hadoop-0-0

/home/hadoop/conf/slaves:

hadoop-0-1
hadoop-0-2

/home/hadoop/conf/hadoop-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-0-0:9000/</value>
<final>true</final>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://hadoop-0-0:9001/</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/data</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/hadoop/mapred/system</value>
<final>true</final>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/hadoop/mapred/local</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>2</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
<final>true</final>
</property>
</configuration>

This defines some services on hadoop-0-0 ports 9000 and 9001, and says that the slaves should get 2 map and 2 reduce tasks each (one for each of my 4 CPUs [remember, I'm hyperthreading]).

Make sure the configuration files are the same on all machines.

Starting hadoop services

From here I pretty much followed the tutorial.

First, set up the filesystem for the NameNode service on the NameNode (hadoop-0-0 for me):

/home/hadoop/bin/hadoop namenode -format

Next start up HDFS, the distributed filesystem:

/home/hadoop/bin/start-dfs.sh

Then start up the JobTracker service on the JobTracker node (hadoop-0-0 for me):

/home/hadoop/bin/start-mapred.sh

This also starts up the TaskTracker and DataNode services on all the nodes specified in the /home/hadoop/conf/slaves file.

At this point, doing a:

ps x

as the hadoop user on the hadoop-0-{0,1,2} machines will show some java daemons running. If you run into trouble, there is diagnostic information in the /home/hadoop/logs/*.log files. These were indispensible in debugging my setup.

Testing hadoop setup

Testing HDFS

To test HDFS, the distributed file system. On hadoop-0-0:

[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -ls /
Found 1 items
/home   <dir>           2008-06-20 18:58        rwxr-xr-x       hadoop  supergroup
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -touchz /foo
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -ls /
Found 2 items
/foo    <r 3>   0       2008-06-20 23:49        rw-r–r–       hadoop  supergroup
/home   <dir>           2008-06-20 18:58        rwxr-xr-x       hadoop  supergroup
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -rm /foo
Deleted /foo
[hadoop@hadoop-0-0 ~]$

Testing MapReduce

Haven’t done this yet, except in standalone mode.

Computing
Java
Science

Comments (4)

Permalink