Notes on setting up hadoop on CentOS 5 x86

Overview

This documents my experience setting up hadoop on three 2-cpu hyperthreaded Centos 5 x86 boxes.

The machines being used are called:

  • hadoop-0-0
  • hadoop-0-1
  • hadoop-0-2

System Setup

Unless otherwise stated, each task was executed on all three machines.

Set up hadoop user

I added a “hadoop” user up on each node, and untarred the hadoop source directly in to the /home/hadoop directory.  So there was a /home/hadoop/conf, /home/hadoop/bin, etc.

useradd hadoop
passwd hadoop
cd /tmp
wget http://apache.siamwebhosting.com/hadoop/core/hadoop-0.17.0/hadoop-0.17.0.tar.gz
su hadoop
cd ~
tar -xvzf /tmp/hadoop*gz
mv hadoop*/* ./
rmdir hadoop*
exit

Next, I set up passphrase-less ssh. On hadoop-0-0, I set up a DSA key:

su hadoop
ssh-keygen -t dsa
#[...use blank passphrase...]
scp ~/.ssh/id_dsa.pub hadoop-0-1:/tmp/
scp ~/.ssh/id_dsa.pub hadoop-0-2:/tmp/

Then, on hadoop-0-1 and hadoop-0-2:

su hadoop
mkdir ~/.ssh
chmod 700 ~/.ssh
cp /tmp/id_dsa.pub ~/.ssh/authorized_keys
chmod 644 ~/.ssh/authorized_keys

This allows login from hadoop-0-0 to both hadoop-0-1 and hadoop-0-2 without typing a password. It’s necessary.

Install prerequisite software

ssh and rsync are required. You can install them like so:

yum -y install ssh rsync

I installed Java 6 from Sun (it’s important to mention here that the CentOS yum/JPackage RPMs for gcc-java, etc did not work for Mahout, so it had to be the Sun Java).

rpm -Uvh j*rpm

Setup system services

I turned off iptables and ip6tables, and disabled them on startup. You could configure them, but I just turned them off. They get in the way of the nodes communicating.

/etc/init.d/iptables stop
/etc/init.d/ip6tables stop
/usr/sbin/ntsysv
#[...]

Then, I edited the hosts file. I actually did this later in the process after a bunch of debugging, but it makes sense to do it here. You need to make sure that the name by which you refer to a host is not associated with the loopback IP address (127.0.0.1). So your /etc/hosts file should look something like this on hadoop-0-0:

127.0.0.1    localhost localhost.localdomain
::1    localhost6.localdomain6 localhost6
10.0.0.1    hadoop-0-0

where the 10.0.0.1 entry is optional if you have some other way to resolve it, e.g. DNS. The key point is that the hadoop-0-0 name not be associated with 127.0.0.1.

Configuring hadoop in standalone mode

I tried using the Hadoop Quick Start guide first. It was trivial to get working in standalone mode. The documentation is sufficient, I won’t discuss it further.

Configuring hadoop in cluster mode

I tried using the Hadoop Cluster Setup guide next. It describes there are four types of hadoop services, split across three types of machines:

  • NameNode machine, runs NameNode service
  • JobTracker machine, runs JobTracker service
  • Slave machine, runs TaskTracker and DataNode services

I wanted to have 2 slave nodes.  So I decided to split it up like:

  • hadoop-0-0 - NameNode and JobTracker
  • hadoop-0-1 - DataNode and TaskTracker
  • hadoop-0-2 - DataNode and TaskTracker

Here’s what my config files look like on all three machines:

/home/hadoop/conf/hadoop-env.sh: added one line.

export JAVA_HOME=/usr/java/jdk1.6.0_06

/home/hadoop/conf/masters:

hadoop-0-0

/home/hadoop/conf/slaves:

hadoop-0-1
hadoop-0-2

/home/hadoop/conf/hadoop-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-0-0:9000/</value>
<final>true</final>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://hadoop-0-0:9001/</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/data</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/hadoop/mapred/system</value>
<final>true</final>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/hadoop/mapred/local</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>2</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
<final>true</final>
</property>
</configuration>

This defines some services on hadoop-0-0 ports 9000 and 9001, and says that the slaves should get 2 map and 2 reduce tasks each (one for each of my 4 CPUs [remember, I'm hyperthreading]).

Make sure the configuration files are the same on all machines.

Starting hadoop services

From here I pretty much followed the tutorial.

First, set up the filesystem for the NameNode service on the NameNode (hadoop-0-0 for me):

/home/hadoop/bin/hadoop namenode -format

Next start up HDFS, the distributed filesystem:

/home/hadoop/bin/start-dfs.sh

Then start up the JobTracker service on the JobTracker node (hadoop-0-0 for me):

/home/hadoop/bin/start-mapred.sh

This also starts up the TaskTracker and DataNode services on all the nodes specified in the /home/hadoop/conf/slaves file.

At this point, doing a:

ps x

as the hadoop user on the hadoop-0-{0,1,2} machines will show some java daemons running. If you run into trouble, there is diagnostic information in the /home/hadoop/logs/*.log files. These were indispensible in debugging my setup.

Testing hadoop setup

Testing HDFS

To test HDFS, the distributed file system. On hadoop-0-0:

[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -ls /
Found 1 items
/home   <dir>           2008-06-20 18:58        rwxr-xr-x       hadoop  supergroup
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -touchz /foo
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -ls /
Found 2 items
/foo    <r 3>   0       2008-06-20 23:49        rw-r–r–       hadoop  supergroup
/home   <dir>           2008-06-20 18:58        rwxr-xr-x       hadoop  supergroup
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -rm /foo
Deleted /foo
[hadoop@hadoop-0-0 ~]$

Testing MapReduce

Haven’t done this yet, except in standalone mode.