This page documents setting up a small cluster installation of hadoop and pig. I used a similar guide for Ubuntu that does hadoop setup from here and here, then later went on to configure pig (and eventually hbase).
All of this was done on Fedora Core 14 x86_64. As I upgrade to 15, I’ll update as necessary.
First things first, create a hadoop user and setup password-less ssh keys.
[hadoop@hostname ~]$ ssh-keygen -t rsa [hadoop@hostname ~]$ cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys [hadoop@hostname ~]$ chmod 600 ~/.ssh/authorized_keys
Java:
Download the latest rpm.bin Java from Oracle, which was 6u26 when I went through this. (You can use these directions or mine). I got the 64-bit RPM installer from Oracle.
# Make it executable and run it! chmod +x jdk-6u26-linux-x64-rpm.bin sudo sh jdk-6u26-linux-x64-rpm.bin
# Configure alternatives so java points to the right binary sudo /usr/sbin/alternatives --install /usr/bin/java java \ /usr/java/default/bin/java 20000
Get Hadoop and Pig:
Pick a mirror and get Hadoop 0.20.2 from here. Pick a mirror and get Hbase 0.90.3 from here. Finally, get Pig 0.8.1 from here. Untar the three packages into the hadoop user’s home directory and edit ~/.bashrc to set environment variables with the correct paths.
# .bashrc export HADOOP_HOME=/home/hadoop/hadoop-0.20.2 export HBASE_HOME=/home/hadoop/hbase-0.90.3 export PIG_HOME=/home/hadoop/pig-0.8.1 export HADOOPDIR=$HADOOP_HOME/conf # this tells hadoop to bind to IPv4 addresses instead of IPv6! export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true export JAVA_HOME=/usr/java/latest/ # It's handy to have the tools in your path too. export PATH=$PATH:$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin
At this point, either log out and log back in or “source ~/.bashrc”.
Start configuring Hadoop:
Lets assume we have 3 computers to be part of the hadoop cluster. Pick one to be the master, and the other two to be slaves. We will use their hostnames for configuration, and for the sake of simplicity I’ll call them “my-master”, “slave1″, and “slave2″. Make sure the hostnames you use are the real hostnames that they use, not nicknames or IP addresses.
Verify you can ssh into each box without needing a password (i.e. You should be able to ‘ssh my-master’, ‘ssh slave1′, ‘ssh slave2′, all without typing a password).
Decide on a data directory for Hadoop. This can be anywhere, but it should be machine-local (not on NFS or samba shares!). I use /data/hadoop/tmp/, and hadoop will create the tmp folder if it has permissions to do so. Do the following:
sudo mkdir -p /data/hadoop sudo chown hadoop.hadoop /data/hadoop
edit $HADOOP_HOME/conf/core-site.xml:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/data/hadoop/tmp</value> </property> <property> <name>fs.default.name</name> <value>hdfs://my-master:54310</value> </property> </configuration>
edit $HADOOP_HOME/conf/hdfs-site.xml Set the replication value. The value for it really depends on your jobs, 3 is the default value and can be overridden on a per-file basis.
<configuration> <property> <name>dfs.replication</name> <value>3</value> </description> </property> </configuration>
edit $HADOOP_HOME/conf/mapred-site.xml
<configuration> <property> <name>mapred.job.tracker</name> <value>my-master:54311</value> </property> </configuration>
$HADOOP_HOME/conf/masters is simply a list of hostnames that are masters:
my-master
$HADOOP_HOME/conf/slaves is a list of hostnames that are slaves:
my-master slave1 slave2
At this point you can format the namenode, from the master run:
hadoop namenode -format
If that doesn’t work, or produces errors, you likely have permissions wrong on your hadoop temp directory. If the format command works, then start up the HDFS and tasktrackers.
$HADOOP_HOME/bin/start-all.sh
Other Hadoop Settings
There’s a few other things to configure once you have the basics running, like changing the number of map and reduce jobs per node, increasing memory for mappers, etc… These are job-dependent, so each install (and even each job) will need to tweak for better performance.
# hdfs-site.xml # increase to allow more connections to HDFS <property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property>
# mapred-site.xml # set these values based on the number of cores and processing # that you plan to do. <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>8</value> </property>
# hadoop-env.sh # The maximum amount of heap to use, in MB. Default is 1000. export HADOOP_HEAPSIZE=4000 # Extra Java runtime options. Empty by default. export HADOOP_OPTS="-server -XX:+UseParallelGC -XX:+UseAdaptiveSizePolicy"