First, see Jonas Widriksson's' blog to see - in details - how to run Hadoop 1.2.1 on the Raspberry PI.
This article is an update, for version 2.0.6.
hadoop group and hduser user.
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo
cd ~/
wget http://apache.mirrors.spacedump.net/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
sudo mkdir /opt
sudo tar -xvzf hadoop-2.0.6.tar.gz -C /opt/
cd /opt
sudo mv hadoop-2.0.6 hadoop
sudo chown -R hduser:hadoop hadoop
Configure environment variables, in .bashrc
export JAVA_HOME=[The location of the bin directory java lives in]
export HADOOP_INSTALL=/opt/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
Re-log as hduser, and check if it works:
Prompt> hadoop version
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar
Configure Haddop environment variables.
hadoop-env.sh now lives in $HADOOP_INSTALL/etc/hadoop.
# The java implementation to use. Required.
export JAVA_HOME=[Like above]
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=250
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS -client"
The -client parameter is important. The default value is -server, which is not supported on the VMs available on the Raspberry PI.
Error occurred during initialization of VM
Server VM is only supported on ARMv7+ VFP
$HADOOP_INSTALL/etc/hadoop, edit the following configuration files:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/hdfs/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
</configuration>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
sudo mkdir -p /hdfs/tmp
sudo chown hduser:hadoop /hdfs/tmp
sudo chmod 750 /hdfs/tmp
hadoop namenode -format
$HADOOP_INSTALL/etc/bootstrap.sh
#!/bin/bash
: ${HADOOP_PREFIX:=/opt/hadoop}
echo -- Setting up env
$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
echo -- Cleaning...
rm /tmp/*.pid
# installing libraries if any - (resource urls added comma separated to the ACP system variable)
cd $HADOOP_PREFIX/share/hadoop/common ; for cp in ${ACP//,/ }; do echo == $cp; curl -LO $cp ; done; cd -
echo -- Starting HDS
$HADOOP_PREFIX/sbin/start-dfs.sh 2> hadoop.err.log
echo -- Starting Yarn
$HADOOP_PREFIX/sbin/start-yarn.sh 2>> hadoop.err.log
if [[ $1 == "-d" ]]
then
while true
do
sleep 1000
done
fi
if [[ $1 == "-bash" ]]
then
/bin/bash
fi
Login as hduser. Run:
Prompt> $HADOOP_INSTALL/etc/bootstrap.sh
Run the jps (Java Process Status) command to check that all services started as supposed to:
Prompt> jps
hduser@hadoop-node1 /opt/hadoop $ jps
2320 NameNode
2736 ResourceManager
2533 SecondaryNameNode
2407 DataNode
4328 Jps
2825 NodeManager
If you cannot see all of the processes above review the log files in $HADOOP_INSTALL/logs to find the source of the problem.
$HADOOP_INSTALL/etc/run.test:
#!/bin/bash
FILE_2_PROCESS=LICENSE.txt
if [ "$1" != "" ]
then
FILE_2_PROCESS=$1
fi
cd $HADOOP_INSTALL
rm hadoop.log
echo -- Listing HDFS Content --
hdfs dfs -ls -R / 2> hadoop.log
echo -- Removing HDFS Content --
hdfs dfs -rm /text.txt 2>> hadoop.log
hdfs dfs -rm -R /text.out.txt 2>> hadoop.log
echo -- Adding new content in HDFS: $FILE_2_PROCESS --
hdfs dfs -copyFromLocal ./$FILE_2_PROCESS /text.txt 2>> hadoop.log
echo -- Running MapReduce --
time hadoop jar $HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar \
wordcount /text.txt /text.out.txt 2>> hadoop.log
echo -- MapReduce completed --
echo -- Extracting the result --
rm -r ./process.out.txt
hdfs dfs -copyToLocal /text.out.txt ./ 2>> hadoop.log
echo -- Tailing the result file: --
tail ./text.out.txt/part*
echo -------------------
echo See also hadoop.log
This uploads sample files to HDFS (Feel free to grab any other textfile you like better than license.txt, just pass its name as a parameter to the script).
-- Listing HDFS Content --
drwxr-xr-x - hduser supergroup 0 2015-03-11 14:13 /text.out.txt
drwxr-xr-x - hduser supergroup 0 2015-03-11 14:13 /text.out.txt/_temporary
drwxr-xr-x - hduser supergroup 0 2015-03-11 14:13 /text.out.txt/_temporary/0
-rw-r--r-- 1 hduser supergroup 1257294 2015-03-11 14:12 /text.txt
-- Removing HDFS Content --
Deleted /text.txt
Deleted /text.out.txt
-- Adding new content in HDFS: moby.dick.txt --
-- Running MapReduce --
real 1m38.132s
user 1m23.720s
sys 0m3.350s
-- MapReduce completed --
-- Extracting the result --
-- Tailing the result file: --
zig-zag 1
zodiac 1
zodiac, 2
zone 2
zone, 1
zone. 2
zoned 1
zones 2
zones. 1
zoology 1
-------------------
See also hadoop.log
real 5m28.078s
user 4m8.740s
sys 0m11.450s
The following tests were done with the exact same micro SD card (Class 4):
real 1m38.132s
user 1m23.720s
sys 0m3.350s
On a Raspberry 2 B, 1Gb of RAM, for the mapreduce task only, I had
real 0m29.795s
user 0m30.610s
sys 0m1.340s
On a Raspberry A (or A+), 256Mb of RAM, for the mapreduce task only, I had
real 1m21.812s
user 0m50.940s
sys 0m4.930s
... interesting.