<< Summary

Run Hadoop 2.0.6 on the Raspberry PI - Act 1

First, see Jonas Widriksson's' blog to see - in details - how to run Hadoop 1.2.1 on the Raspberry PI.
This article is an update, for version 2.0.6.

Single Node Setup

Prepare hadoop group and hduser user.

 
 sudo addgroup hadoop
 sudo adduser --ingroup hadoop hduser
 sudo adduser hduser sudo

Download and install


 cd ~/
 wget http://apache.mirrors.spacedump.net/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
 sudo mkdir /opt
 sudo tar -xvzf hadoop-2.0.6.tar.gz -C /opt/
 cd /opt
 sudo mv hadoop-2.0.6 hadoop
 sudo chown -R hduser:hadoop hadoop

Configure environment variables, in .bashrc


 export JAVA_HOME=[The location of the bin directory java lives in]
 export HADOOP_INSTALL=/opt/hadoop
 export PATH=$PATH:$HADOOP_INSTALL/bin

Re-log as hduser, and check if it works:


 Prompt> hadoop version
 Hadoop 2.6.0
 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
 Compiled by jenkins on 2014-11-13T21:10Z
 Compiled with protoc 2.5.0
 From source with checksum 18e43357c8f927c0695f1e9522859d6a
 This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar

Configure Haddop environment variables.
The file hadoop-env.sh now lives in $HADOOP_INSTALL/etc/hadoop.


 # The java implementation to use. Required.
 export JAVA_HOME=[Like above]

 # The maximum amount of heap to use, in MB. Default is 1000.
 export HADOOP_HEAPSIZE=250

 # Command specific options appended to HADOOP_OPTS when specified
 export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS -client"

The -client parameter is important. The default value is -server, which is not supported on the VMs available on the Raspberry PI.
This would eventually result in this error:

 
 Error occurred during initialization of VM
 Server VM is only supported on ARMv7+ VFP

Configure Hadoop

In $HADOOP_INSTALL/etc/hadoop, edit the following configuration files:

core-site.xml

    
 <configuration>
   <property>
     <name>hadoop.tmp.dir</name>
     <value>/hdfs/tmp</value>
   </property>
   <property>
     <name>fs.default.name</name>
     <value>hdfs://localhost:54310</value>
   </property>
 </configuration>

mapred-site.xml

    
 <configuration>
   <property>
     <name>mapred.job.tracker</name>
     <value>localhost:54311</value>
   </property>
 </configuration>

hdfs-site.xml

    
 <configuration>
   <property>
     <name>dfs.replication</name>
     <value>1</value>
   </property>
 </configuration>

Create HDFS file system


 sudo mkdir -p /hdfs/tmp
 sudo chown hduser:hadoop /hdfs/tmp
 sudo chmod 750 /hdfs/tmp
 hadoop namenode -format

Start services

I have created a script, named $HADOOP_INSTALL/etc/bootstrap.sh


#!/bin/bash

: ${HADOOP_PREFIX:=/opt/hadoop}
echo -- Setting up env
$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

echo -- Cleaning...
rm /tmp/*.pid

# installing libraries if any - (resource urls added comma separated to the ACP system variable)
cd $HADOOP_PREFIX/share/hadoop/common ; for cp in ${ACP//,/ }; do  echo == $cp; curl -LO $cp ; done; cd -

echo -- Starting HDS
$HADOOP_PREFIX/sbin/start-dfs.sh 2> hadoop.err.log
echo -- Starting Yarn
$HADOOP_PREFIX/sbin/start-yarn.sh 2>> hadoop.err.log

if [[ $1 == "-d" ]]
then
  while true
  do
    sleep 1000
  done
fi

if [[ $1 == "-bash" ]] 
 then
  /bin/bash
fi


 Prompt> $HADOOP_INSTALL/etc/bootstrap.sh

Run the jps (Java Process Status) command to check that all services started as supposed to:


 Prompt> jps
 hduser@hadoop-node1 /opt/hadoop $ jps
 2320 NameNode
 2736 ResourceManager
 2533 SecondaryNameNode
 2407 DataNode
 4328 Jps
 2825 NodeManager

If you cannot see all of the processes above review the log files in $HADOOP_INSTALL/logs to find the source of the problem.

Run sample test

There is another script named $HADOOP_INSTALL/etc/run.test:

    
 #!/bin/bash
 FILE_2_PROCESS=LICENSE.txt
 if [ "$1" != "" ]
 then
   FILE_2_PROCESS=$1
 fi
 cd $HADOOP_INSTALL
 rm hadoop.log
 echo -- Listing HDFS Content --
 hdfs dfs -ls -R / 2> hadoop.log
 echo -- Removing HDFS Content --
 hdfs dfs -rm /text.txt 2>> hadoop.log
 hdfs dfs -rm -R /text.out.txt 2>> hadoop.log
 echo -- Adding new content in HDFS: $FILE_2_PROCESS --
 hdfs dfs -copyFromLocal ./$FILE_2_PROCESS /text.txt 2>> hadoop.log
 echo --  Running MapReduce --
 time hadoop jar $HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar \
             wordcount /text.txt /text.out.txt 2>> hadoop.log
 echo -- MapReduce completed --
 echo -- Extracting the result --
 rm -r ./process.out.txt
 hdfs dfs -copyToLocal /text.out.txt ./ 2>> hadoop.log
 echo -- Tailing the result file: --
 tail ./text.out.txt/part*
 echo -------------------
 echo See also hadoop.log

This uploads sample files to HDFS (Feel free to grab any other textfile you like better than license.txt, just pass its name as a parameter to the script).
When completed you will see some statistics about the job.
Run:


 -- Listing HDFS Content --
 drwxr-xr-x   - hduser supergroup          0 2015-03-11 14:13 /text.out.txt
 drwxr-xr-x   - hduser supergroup          0 2015-03-11 14:13 /text.out.txt/_temporary
 drwxr-xr-x   - hduser supergroup          0 2015-03-11 14:13 /text.out.txt/_temporary/0
 -rw-r--r--   1 hduser supergroup    1257294 2015-03-11 14:12 /text.txt
 -- Removing HDFS Content --
 Deleted /text.txt
 Deleted /text.out.txt
 -- Adding new content in HDFS: moby.dick.txt --
 -- Running MapReduce --

 real    1m38.132s
 user    1m23.720s
 sys     0m3.350s
 -- MapReduce completed --
 -- Extracting the result --
 -- Tailing the result file: --
 zig-zag 1
 zodiac  1
 zodiac, 2
 zone    2
 zone,   1
 zone.   2
 zoned   1
 zones   2
 zones.  1
 zoology 1
 -------------------
 See also hadoop.log

 real    5m28.078s
 user    4m8.740s
 sys     0m11.450s

The following tests were done with the exact same micro SD card (Class 4):
On a Raspberry B (or B+), 512Mb of RAM, for the mapreduce task only, I had


 real    1m38.132s
 user    1m23.720s
 sys     0m3.350s

On a Raspberry 2 B, 1Gb of RAM, for the mapreduce task only, I had


 real    0m29.795s
 user    0m30.610s
 sys     0m1.340s

On a Raspberry A (or A+), 256Mb of RAM, for the mapreduce task only, I had


 real    1m21.812s
 user    0m50.940s
 sys     0m4.930s

... interesting.

Oliv did it