Overview
This documents my experience setting up hadoop on three 2-cpu hyperthreaded Centos 5 x86 boxes.
The machines being used are called:
- hadoop-0-0
- hadoop-0-1
- hadoop-0-2
System Setup
Unless otherwise stated, each task was executed on all three machines.
Set up hadoop user
I added a “hadoop” user up on each node, and untarred the hadoop source directly in to the /home/hadoop directory. So there was a /home/hadoop/conf, /home/hadoop/bin, etc.
useradd hadoop passwd hadoop cd /tmp wget http://apache.siamwebhosting.com/hadoop/core/hadoop-0.17.0/hadoop-0.17.0.tar.gz su hadoop cd ~ tar -xvzf /tmp/hadoop*gz mv hadoop*/* ./ rmdir hadoop* exit
Next, I set up passphrase-less ssh. On hadoop-0-0, I set up a DSA key:
su hadoop ssh-keygen -t dsa #[...use blank passphrase...] scp ~/.ssh/id_dsa.pub hadoop-0-1:/tmp/ scp ~/.ssh/id_dsa.pub hadoop-0-2:/tmp/
Then, on hadoop-0-1 and hadoop-0-2:
su hadoop mkdir ~/.ssh chmod 700 ~/.ssh cp /tmp/id_dsa.pub ~/.ssh/authorized_keys chmod 644 ~/.ssh/authorized_keys
This allows login from hadoop-0-0 to both hadoop-0-1 and hadoop-0-2 without typing a password. It’s necessary.
Install prerequisite software
ssh and rsync are required. You can install them like so:
yum -y install ssh rsync
I installed Java 6 from Sun (it’s important to mention here that the CentOS yum/JPackage RPMs for gcc-java, etc did not work for Mahout, so it had to be the Sun Java).
rpm -Uvh j*rpm
Setup system services
I turned off iptables and ip6tables, and disabled them on startup. You could configure them, but I just turned them off. They get in the way of the nodes communicating.
/etc/init.d/iptables stop /etc/init.d/ip6tables stop /usr/sbin/ntsysv #[...]
Then, I edited the hosts file. I actually did this later in the process after a bunch of debugging, but it makes sense to do it here. You need to make sure that the name by which you refer to a host is not associated with the loopback IP address (127.0.0.1). So your /etc/hosts file should look something like this on hadoop-0-0:
127.0.0.1 localhost localhost.localdomain ::1 localhost6.localdomain6 localhost6 10.0.0.1 hadoop-0-0
where the 10.0.0.1 entry is optional if you have some other way to resolve it, e.g. DNS. The key point is that the hadoop-0-0 name not be associated with 127.0.0.1.
Configuring hadoop in standalone mode
I tried using the Hadoop Quick Start guide first. It was trivial to get working in standalone mode. The documentation is sufficient, I won’t discuss it further.
Configuring hadoop in cluster mode
I tried using the Hadoop Cluster Setup guide next. It describes there are four types of hadoop services, split across three types of machines:
- NameNode machine, runs NameNode service
- JobTracker machine, runs JobTracker service
- Slave machine, runs TaskTracker and DataNode services
I wanted to have 2 slave nodes. So I decided to split it up like:
- hadoop-0-0 - NameNode and JobTracker
- hadoop-0-1 - DataNode and TaskTracker
- hadoop-0-2 - DataNode and TaskTracker
Here’s what my config files look like on all three machines:
/home/hadoop/conf/hadoop-env.sh: added one line.
export JAVA_HOME=/usr/java/jdk1.6.0_06
/home/hadoop/conf/masters:
hadoop-0-0
/home/hadoop/conf/slaves:
hadoop-0-1 hadoop-0-2
/home/hadoop/conf/hadoop-site.xml:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://hadoop-0-0:9000/</value> <final>true</final> </property> <property> <name>mapred.job.tracker</name> <value>hdfs://hadoop-0-0:9001/</value> <final>true</final> </property> <property> <name>dfs.data.dir</name> <value>/home/hadoop/data</value> <final>true</final> </property> <property> <name>mapred.system.dir</name> <value>/home/hadoop/mapred/system</value> <final>true</final> </property> <property> <name>mapred.local.dir</name> <value>/home/hadoop/mapred/local</value> <final>true</final> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>2</value> <final>true</final> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>2</value> <final>true</final> </property> </configuration>
This defines some services on hadoop-0-0 ports 9000 and 9001, and says that the slaves should get 2 map and 2 reduce tasks each (one for each of my 4 CPUs [remember, I'm hyperthreading]).
Make sure the configuration files are the same on all machines.
Starting hadoop services
From here I pretty much followed the tutorial.
First, set up the filesystem for the NameNode service on the NameNode (hadoop-0-0 for me):
/home/hadoop/bin/hadoop namenode -format
Next start up HDFS, the distributed filesystem:
/home/hadoop/bin/start-dfs.sh
Then start up the JobTracker service on the JobTracker node (hadoop-0-0 for me):
/home/hadoop/bin/start-mapred.sh
This also starts up the TaskTracker and DataNode services on all the nodes specified in the /home/hadoop/conf/slaves file.
At this point, doing a:
ps x
as the hadoop user on the hadoop-0-{0,1,2} machines will show some java daemons running. If you run into trouble, there is diagnostic information in the /home/hadoop/logs/*.log files. These were indispensible in debugging my setup.
Testing hadoop setup
Testing HDFS
To test HDFS, the distributed file system. On hadoop-0-0:
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -ls / Found 1 items /home <dir> 2008-06-20 18:58 rwxr-xr-x hadoop supergroup [hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -touchz /foo [hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -ls / Found 2 items /foo <r 3> 0 2008-06-20 23:49 rw-r–r– hadoop supergroup /home <dir> 2008-06-20 18:58 rwxr-xr-x hadoop supergroup [hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -rm /foo Deleted /foo [hadoop@hadoop-0-0 ~]$
Testing MapReduce
Haven’t done this yet, except in standalone mode.

yodafide | 10-Jul-08 at 2:17 pm | Permalink
$ bin/hadoop jar hadoop-*-examples.jar grep input output ‘dfs[a-z.]+’
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 2: $’\r’: command not found
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 7: $’\r’: command not found
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 10: $’\r’: command not found
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 13: $’\r’: command not found
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 16: $’\r’: command not found
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 19: $’\r’: command not found
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 22: $’\r’: command not found
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 25: $’\r’: command not found
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 28: $’\r’: command not found
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 31: $’\r’: command not found
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 36: $’\r’: command not found
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 39: $’\r’: command not found
/cygdrive/c/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/bin/../co
nf/hadoop-env.sh: line 42: $’\r’: command not found
08/07/10 14:12:20 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName
=JobTracker, sessionId=
08/07/10 14:12:20 INFO mapred.FileInputFormat: Total input paths to process : 2
08/07/10 14:12:21 INFO mapred.JobClient: Running job: job_local_1
08/07/10 14:12:21 INFO mapred.MapTask: numReduceTasks: 1
08/07/10 14:12:21 INFO mapred.LocalJobRunner: file:/c:/Documents and Settings/ys
khoo/Desktop/hadoop/hadoop-0.16.4/input/hadoop-default.xml:0+34064
08/07/10 14:12:21 INFO mapred.TaskRunner: Task ‘job_local_1_map_0000′ done.
08/07/10 14:12:21 INFO mapred.TaskRunner: Saved output of task ‘job_local_1_map_
0000′ to file:/c:/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/gre
p-temp-56794453
08/07/10 14:12:21 INFO mapred.MapTask: numReduceTasks: 1
08/07/10 14:12:21 INFO mapred.LocalJobRunner: file:/c:/Documents and Settings/ys
khoo/Desktop/hadoop/hadoop-0.16.4/input/hadoop-site.xml:0+184
08/07/10 14:12:21 INFO mapred.TaskRunner: Task ‘job_local_1_map_0001′ done.
08/07/10 14:12:21 INFO mapred.TaskRunner: Saved output of task ‘job_local_1_map_
0001′ to file:/c:/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/gre
p-temp-56794453
08/07/10 14:12:21 INFO mapred.LocalJobRunner: reduce > reduce
08/07/10 14:12:21 INFO mapred.TaskRunner: Task ‘reduce_dtdqnd’ done.
08/07/10 14:12:21 INFO mapred.TaskRunner: Saved output of task ‘reduce_dtdqnd’ t
o file:/c:/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/grep-temp-
56794453
08/07/10 14:12:22 INFO mapred.JobClient: Job complete: job_local_1
08/07/10 14:12:22 INFO mapred.JobClient: Counters: 9
08/07/10 14:12:22 INFO mapred.JobClient: Map-Reduce Framework
08/07/10 14:12:22 INFO mapred.JobClient: Map input records=1124
08/07/10 14:12:22 INFO mapred.JobClient: Map output records=39
08/07/10 14:12:22 INFO mapred.JobClient: Map input bytes=34248
08/07/10 14:12:22 INFO mapred.JobClient: Map output bytes=1114
08/07/10 14:12:22 INFO mapred.JobClient: Combine input records=39
08/07/10 14:12:22 INFO mapred.JobClient: Combine output records=38
08/07/10 14:12:22 INFO mapred.JobClient: Reduce input groups=38
08/07/10 14:12:22 INFO mapred.JobClient: Reduce input records=38
08/07/10 14:12:22 INFO mapred.JobClient: Reduce output records=38
08/07/10 14:12:22 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with proces
sName=JobTracker, sessionId= - already initialized
08/07/10 14:12:22 INFO mapred.FileInputFormat: Total input paths to process : 1
08/07/10 14:12:22 INFO mapred.JobClient: Running job: job_local_2
08/07/10 14:12:22 INFO mapred.MapTask: numReduceTasks: 1
08/07/10 14:12:22 INFO mapred.LocalJobRunner: file:/c:/Documents and Settings/ys
khoo/Desktop/hadoop/hadoop-0.16.4/grep-temp-56794453/part-00000:0+1491
08/07/10 14:12:22 INFO mapred.TaskRunner: Task ‘job_local_2_map_0000′ done.
08/07/10 14:12:22 INFO mapred.TaskRunner: Saved output of task ‘job_local_2_map_
0000′ to file:/c:/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/out
put
08/07/10 14:12:22 INFO mapred.LocalJobRunner: reduce > reduce
08/07/10 14:12:22 INFO mapred.TaskRunner: Task ‘reduce_iy0cmt’ done.
08/07/10 14:12:22 INFO mapred.TaskRunner: Saved output of task ‘reduce_iy0cmt’ t
o file:/c:/Documents and Settings/yskhoo/Desktop/hadoop/hadoop-0.16.4/output
08/07/10 14:12:23 INFO mapred.JobClient: Job complete: job_local_2
08/07/10 14:12:23 INFO mapred.JobClient: Counters: 9
08/07/10 14:12:23 INFO mapred.JobClient: Map-Reduce Framework
08/07/10 14:12:23 INFO mapred.JobClient: Map input records=38
08/07/10 14:12:23 INFO mapred.JobClient: Map output records=38
08/07/10 14:12:23 INFO mapred.JobClient: Map input bytes=1405
08/07/10 14:12:23 INFO mapred.JobClient: Map output bytes=1101
08/07/10 14:12:23 INFO mapred.JobClient: Combine input records=0
08/07/10 14:12:23 INFO mapred.JobClient: Combine output records=0
08/07/10 14:12:23 INFO mapred.JobClient: Reduce input groups=2
08/07/10 14:12:23 INFO mapred.JobClient: Reduce input records=38
08/07/10 14:12:23 INFO mapred.JobClient: Reduce output records=38
I don’t quite get the errors at the very top. This is for the stand-alone operation of Hadoop. I did not find it as trivial, unfortunately. I’m setting this up on Windows btw.
yodafide | 10-Jul-08 at 2:40 pm | Permalink
$ ssh localhost
ssh: connect to host localhost port 22: Connection refused
Also, why do we need passwordless ssh access? And what should I do if the alternate solution that the Quick Start tutorial offers does not work?
allenday | 11-Jul-08 at 5:16 pm | Permalink
Hi yodafide,
It looks like these are windows-specific problems. My notes are only tested on Linux CentOS 5.
Just after a glance though, it looks like you’re having line terminator problems (the “\r” errors).
The reason the ssh is passwordless is that the hadoop admin scripts ssh to each of the storage/exec nodes, and you would have to type a password for each host if you didn’t configure the keys to be authorized.
yodafide | 07-Aug-08 at 4:28 pm | Permalink
Thanks allen. Say have you every tried putting your dfs on another drive when you were going through the pseudo-distibuted single-machine Hadoop tutorial?