JavaSizzler: January 2015

Setup Apache Hadoop in a Standalone Mode

Apache Hadoop is an open source framework for writing and running distributed applications that process large amounts of data.
Hadoop is a rapidly evolving ecosystem of components for implementing the Google MapReduce algorithms in a scalable fashion on commodity hardware.
Hadoop enables users to store and process large volumes of data and analyze it in ways not previously possible with less scalable solutions or standard SQL-based approaches.
Through this tutorial I will provide step by step guide on how to configure Apache Hadoop in Standalone Mode.

Following are the Hadoop different Mode in which it can be configured to run on:

Standalone Mode- In standalone mode, we will configure Hadoop on a single machine (e.g. an Ubuntu machine on the host VM). The configuration in standalone mode is quite straightforward and does not require major changes.
Pseudo-Distributed Mode- In a pseudo distributed environment, we will configure more than one machine, one of these to act as a master and the rest as slave machines/node. In addition we will have more than one Ubuntu machine playing on the host VM.
Fully Distributed Mode- It is quite similar to a pseudo distributed environment with the exception that instead of VM the machines/node will be on a real distributed environment.

Installing & Configuring Hadoop in Standalone Mode
You might want to create a dedicated user for running Apache Hadoop but it is not a prerequisite. In our setup, we will be using a default user for running Hadoop.

Environment:

Ubuntu 10.10
JDK 6 or above
Hadoop-1.1.2 (Any stable release)

Follow these steps for installing and configuring Hadoop on a single node:

Step-1. Install Java
In this tutorial, we will use Java 1.6.

Use the below command to begin the installation of Java

1	`$ sudo apt-get install openjdk-6-jdk`

1	`$ sudo apt-get install sun-java6-jdk`

This will install the full JDK under /usr/lib/jvm/java-6-sundirectory.

Step-2. Verify Java installation
You can verify java installation using the following command

1	`$ java -version`

On executing this command, you should see output similar to the following:
java version “1.6.0_27″
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)

Step-3. Configure JAVA_HOME
Hadoop requires Java installation path to work on, for this we will be setting JAVA_HOME environment variable and this will point to our Java installation dir.
Java_Home can be configured in ~/.bash_profile or ~/.bashrc file. Alternatively you can also let hadoop know this by setting Java_Home in hadoop conf/hadoop-env.sh file.

Use the below command to set JAVA_HOME on Ubuntu

1	`export JAVA_HOME=/usr/lib/jvm/java-6-sun`

JAVA_HOME can be verified by command

1	`echo $JAVA_HOME`

Step-4. SSH configuration

Install SSH using the command.

1	`sudo apt-get install ssh`

Generate ssh key
ssh -keygen -t rsa -P “” (press enter when asked for a file name; this will generate a passwordless ssh file)

Now copy the public key (id_rsa.pub) of current machine to authorized_keys. Below command copies the generated public key in the .ssh/authorized_keys file:

1	`cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys`

Verify ssh configuration using the command

1	`ssh localhost`

Pressing yes will add localhost to known hosts

Step-5. Download Hadoop
Download the latest stable release of Apache Hadoop from http://hadoop.apache.org/releases.html.
Unpack the release:
tar – zxvf hadoop-1.0.3.tar.gz
Save the extracted folder to an appropriate location, HADOOP_HOME will be pointing to this directory.

Step-6. Configure HADOOP_HOME & Path environment
Use the following command to create an environment variable that points to the Hadoop installation directory (HADOOP_HOME)

1	`export HADOOP_HOME=/home/user/hadoop`

Now place the Hadoop binary directory on your command-line path by executing the command

1	`export PATH=$PATH:$HADOOP_HOME/bin`

Use this command to verify your Hadoop installation:
hadoop version
The output should be similar to below one
Hadoop 1.1.2

Step-7. Create Data Directory for Hadoop
An advantage of using Hadoop is that with just a limited number of directories you can set it up to work correctly. Let us create a directory with the name hdfs and three sub-directories name (represents Name Node), data (represents Data Node) and tmp.

/home/ja/~ mkdir hdfs
/home/ja/hdfs/~ mkdir tmp
/home/ja/hdfs/~ mkdir name
/home/ja/hdfs/~ mkdir data

Since a Hadoop user would require to read-write to these directories you would need to change the permissions of above directories to 755 or 777 for Hadoop user.

Step-8. Configure Hadoop XML files
Next, we will configure Hadoop XML file. Hadoop configuration files are in the HADOOP_HOME/conf dir.

conf/core-site.xml

1
2
3
4
5
6
7
8
9
10 >

<! -- Putting site-specific property overrides the file. -->

fs.default.name
hdfs://localhost:9000

hadoop.temp.dir
/home/ja/hdfs/temp

conf/hdfs-site.xml

1
2
3
4
5
6
7
8
9
10 <! -- Putting site specific property overrides in the file. -->

dfs.name.dir
/home/ja/hdfs/name

dfs.data.dir
/home/ja/hdfs/data

dfs.replication
1

conf/mapred-site.xml

1
2
3
4 <! -- Putting site-specific property overrides this file. -->

mapred.job.tracker
localhost:9001

Step-9. Format Hadoop Name Node:
Execute the below command from hadoop home directory

1	`$ ~/hadoop/bin/hadoop namenode -format`

Step-10. Start Hadoop daemons

1	`$ ~/hadoop/bin/start-all.sh`

Step-11. Verify the daemons are running

1	`$ /usr/java/latest/bin/jps`

output will look similar to this
9316 SecondaryNameNode
9203 DataNode
9521 TaskTracker
9403 JobTracker
9089 NameNode
Now we have all the daemons running:

Step-12. Verify Admin Page UI of Name Node & Job Tracker
Open a browser window and type the following URLs:
Name Node UI: http://localhost:50070
Job Tracker UI: http://localhost:50030

Now you have successfully installed and configured Hadoop on a single node.

Keep posting me your queries. I will try my best to share my opinion on them.
Till then, Happy Reading!!!

Saturday, 17 January 2015

Setup Apache Hadoop in a Standalone Mode