
Overview
This documents my experience setting up hadoop on three 2-cpu hyperthreaded Centos 5 x86 boxes.
The machines being used are called:
- hadoop-0-0
- hadoop-0-1
- hadoop-0-2
System Setup
Unless otherwise stated, each task was executed on all three machines.
Set up hadoop user
I added a “hadoop” user up on each node, and untarred the hadoop source directly in to the /home/hadoop directory. So there was a /home/hadoop/conf, /home/hadoop/bin, etc.
useradd hadoop
passwd hadoop
cd /tmp
wget http://apache.siamwebhosting.com/hadoop/core/hadoop-0.17.0/hadoop-0.17.0.tar.gz
su hadoop
cd ~
tar -xvzf /tmp/hadoop*gz
mv hadoop*/* ./
rmdir hadoop*
exit
Next, I set up passphrase-less ssh. On hadoop-0-0, I set up a DSA key:
su hadoop
ssh-keygen -t dsa
#[...use blank passphrase...]
scp ~/.ssh/id_dsa.pub hadoop-0-1:/tmp/
scp ~/.ssh/id_dsa.pub hadoop-0-2:/tmp/
Then, on hadoop-0-1 and hadoop-0-2:
su hadoop
mkdir ~/.ssh
chmod 700 ~/.ssh
cp /tmp/id_dsa.pub ~/.ssh/authorized_keys
chmod 644 ~/.ssh/authorized_keys
This allows login from hadoop-0-0 to both hadoop-0-1 and hadoop-0-2 without typing a password. It’s necessary.
Install prerequisite software
ssh and rsync are required. You can install them like so:
yum -y install ssh rsync
I installed Java 6 from Sun (it’s important to mention here that the CentOS yum/JPackage RPMs for gcc-java, etc did not work for Mahout, so it had to be the Sun Java).
rpm -Uvh j*rpm
Setup system services
I turned off iptables and ip6tables, and disabled them on startup. You could configure them, but I just turned them off. They get in the way of the nodes communicating.
/etc/init.d/iptables stop
/etc/init.d/ip6tables stop
/usr/sbin/ntsysv
#[...]
Then, I edited the hosts file. I actually did this later in the process after a bunch of debugging, but it makes sense to do it here. You need to make sure that the name by which you refer to a host is not associated with the loopback IP address (127.0.0.1). So your /etc/hosts file should look something like this on hadoop-0-0:
127.0.0.1 localhost localhost.localdomain
::1 localhost6.localdomain6 localhost6
10.0.0.1 hadoop-0-0
where the 10.0.0.1 entry is optional if you have some other way to resolve it, e.g. DNS. The key point is that the hadoop-0-0 name not be associated with 127.0.0.1.
Configuring hadoop in standalone mode
I tried using the Hadoop Quick Start guide first. It was trivial to get working in standalone mode. The documentation is sufficient, I won’t discuss it further.
Configuring hadoop in cluster mode
I tried using the Hadoop Cluster Setup guide next. It describes there are four types of hadoop services, split across three types of machines:
- NameNode machine, runs NameNode service
- JobTracker machine, runs JobTracker service
- Slave machine, runs TaskTracker and DataNode services
I wanted to have 2 slave nodes. So I decided to split it up like:
- hadoop-0-0 - NameNode and JobTracker
- hadoop-0-1 - DataNode and TaskTracker
- hadoop-0-2 - DataNode and TaskTracker
Here’s what my config files look like on all three machines:
/home/hadoop/conf/hadoop-env.sh: added one line.
export JAVA_HOME=/usr/java/jdk1.6.0_06
/home/hadoop/conf/masters:
hadoop-0-0
/home/hadoop/conf/slaves:
hadoop-0-1
hadoop-0-2
/home/hadoop/conf/hadoop-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-0-0:9000/</value>
<final>true</final>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://hadoop-0-0:9001/</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/data</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/hadoop/mapred/system</value>
<final>true</final>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/hadoop/mapred/local</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>2</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
<final>true</final>
</property>
</configuration>
This defines some services on hadoop-0-0 ports 9000 and 9001, and says that the slaves should get 2 map and 2 reduce tasks each (one for each of my 4 CPUs [remember, I'm hyperthreading]).
Make sure the configuration files are the same on all machines.
Starting hadoop services
From here I pretty much followed the tutorial.
First, set up the filesystem for the NameNode service on the NameNode (hadoop-0-0 for me):
/home/hadoop/bin/hadoop namenode -format
Next start up HDFS, the distributed filesystem:
/home/hadoop/bin/start-dfs.sh
Then start up the JobTracker service on the JobTracker node (hadoop-0-0 for me):
/home/hadoop/bin/start-mapred.sh
This also starts up the TaskTracker and DataNode services on all the nodes specified in the /home/hadoop/conf/slaves file.
At this point, doing a:
ps x
as the hadoop user on the hadoop-0-{0,1,2} machines will show some java daemons running. If you run into trouble, there is diagnostic information in the /home/hadoop/logs/*.log files. These were indispensible in debugging my setup.
Testing hadoop setup
Testing HDFS
To test HDFS, the distributed file system. On hadoop-0-0:
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -ls /
Found 1 items
/home <dir> 2008-06-20 18:58 rwxr-xr-x hadoop supergroup
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -touchz /foo
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -ls /
Found 2 items
/foo <r 3> 0 2008-06-20 23:49 rw-r–r– hadoop supergroup
/home <dir> 2008-06-20 18:58 rwxr-xr-x hadoop supergroup
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -rm /foo
Deleted /foo
[hadoop@hadoop-0-0 ~]$
Testing MapReduce
Haven’t done this yet, except in standalone mode.