Hadoop on Fedora Core 14

This page documents setting up a small cluster installation of hadoop and pig. I used a similar guide for Ubuntu that does hadoop setup from here and here, then later went on to configure pig (and eventually hbase).

All of this was done on Fedora Core 14 x86_64. As I upgrade to 15, I’ll update as necessary.

First things first, create a hadoop user and setup password-less ssh keys.

[hadoop@hostname ~]$ ssh-keygen -t rsa
[hadoop@hostname ~]$ cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
[hadoop@hostname ~]$ chmod 600 ~/.ssh/authorized_keys

Java:

Download the latest rpm.bin Java from Oracle, which was 6u26 when I went through this. (You can use these directions or mine). I got the 64-bit RPM installer from Oracle.

# Make it executable and run it!
chmod +x jdk-6u26-linux-x64-rpm.bin
sudo sh jdk-6u26-linux-x64-rpm.bin
# Configure alternatives so java points to the right binary
sudo /usr/sbin/alternatives --install /usr/bin/java java \
/usr/java/default/bin/java 20000

Get Hadoop and Pig:

Pick a mirror and get Hadoop 0.20.2 from here. Pick a mirror and get Hbase 0.90.3 from here. Finally, get Pig 0.8.1 from here. Untar the three packages into the hadoop user’s home directory and edit  ~/.bashrc to set environment variables with the correct paths.

# .bashrc
export HADOOP_HOME=/home/hadoop/hadoop-0.20.2
export HBASE_HOME=/home/hadoop/hbase-0.90.3
export PIG_HOME=/home/hadoop/pig-0.8.1
export HADOOPDIR=$HADOOP_HOME/conf

# this tells hadoop to bind to IPv4 addresses instead of IPv6!
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

export JAVA_HOME=/usr/java/latest/

# It's handy to have the tools in your path too.
export PATH=$PATH:$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin

At this point, either log out and log back in or “source ~/.bashrc”.

Start configuring Hadoop:

Lets assume we have 3 computers to be part of the hadoop cluster. Pick one to be the master, and the other two to be slaves. We will use their hostnames for configuration, and for the sake of simplicity I’ll call them “my-master”, “slave1″, and “slave2″. Make sure the hostnames you use are the real hostnames that they use, not nicknames or IP addresses.

Verify you can ssh into each box without needing a password (i.e. You should be able to ‘ssh my-master’, ‘ssh slave1′, ‘ssh slave2′, all without typing a password).

Decide on a data directory for Hadoop. This can be anywhere, but it should be machine-local (not on NFS or samba shares!). I use /data/hadoop/tmp/, and hadoop will create the tmp folder if it has permissions to do so. Do the following:

sudo mkdir -p /data/hadoop
sudo chown hadoop.hadoop /data/hadoop

edit $HADOOP_HOME/conf/core-site.xml:

<configuration>
<property>
 <name>hadoop.tmp.dir</name>
 <value>/data/hadoop/tmp</value>
</property>

<property>
 <name>fs.default.name</name>
 <value>hdfs://my-master:54310</value>
</property>
</configuration>

edit $HADOOP_HOME/conf/hdfs-site.xml Set the replication value. The value for it really depends on your jobs, 3 is the default value and can be overridden on a per-file basis.

<configuration>
<property>
 <name>dfs.replication</name>
 <value>3</value>
 </description>
</property>
</configuration>

edit $HADOOP_HOME/conf/mapred-site.xml

<configuration>
<property>
 <name>mapred.job.tracker</name>
 <value>my-master:54311</value>
</property>
</configuration>

$HADOOP_HOME/conf/masters is simply a list of hostnames that are masters:

my-master

$HADOOP_HOME/conf/slaves is a list of hostnames that are slaves:

my-master
slave1
slave2

At this point you can format the namenode, from the master run:

hadoop namenode -format

If that doesn’t work, or produces errors, you likely have permissions wrong on your hadoop temp directory. If the format command works, then start up the HDFS and tasktrackers.

$HADOOP_HOME/bin/start-all.sh

Other Hadoop Settings

There’s a few other things to configure once you have the basics running, like changing the number of map and reduce jobs per node, increasing memory for mappers, etc… These are job-dependent, so each install (and even each job) will need to tweak for better performance.

# hdfs-site.xml
# increase to allow more connections to HDFS
<property>
  <name>dfs.datanode.max.xcievers</name>
  <value>4096</value>
</property>
# mapred-site.xml
# set these values based on the number of cores and processing
# that you plan to do.
<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>8</value>
</property>
<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>8</value>
</property>
# hadoop-env.sh
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=4000

# Extra Java runtime options.  Empty by default.
export HADOOP_OPTS="-server -XX:+UseParallelGC -XX:+UseAdaptiveSizePolicy"

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>