Friday 30 December 2016

Hadoop Distributed File System

Hadoop Distributed File System:

         Hadoop Distributed File System is a specially designed file system i.e., to store large amounts of data across a cluster of commodity hardware machines for streaming data access. Hadoop distribution file system in the primary storage system used by big data applications.

       HDFS helps in providing high-performance access to data across Hadoop clusters. HDFS is deployed on low-cost commodity hardware. System failures are common when we use low-cost commodity hardware. But HDFS is designed to be highly fault-tolerant as it stores each data block in it in 3 replications(default, can be changed). So of one node fails in providing that block while processing, the program will not stop else it takes data from the another node where it's replica is placed on availability. This decreases the risk of catastrophic failure, even in the event that numerous nodes fail. The term used to allocate the number of replications is replication factor.

          What are blocks? When HDFS takes input data, it breaks the information down into separate pieces called blocks and distributes the copies to different nodes in a cluster, allowing for parallel processing. In Hadoop 1.2.x default block size is 64MB and in a newer version it's size is 128MB. That means if we want to store a 200MB file in HDFS storage then, HDFS divides the 200MB file to 64MB+64MB+64MB+8MB  4 blocks and stores in nodes. And their replicas are stored in another node.For example, if the replication factor was set to 3 (default value in HDFS) there would be one original block and two replicas.

      There are 5 daemons associated in HDFS. They are

• JobTracker: -  Performs task assignment.
• TaskTracker: - Performs map/reduce tasks.
• NameNode: -  Maintains all file system metadata if the Hadoop Distributed File System (HDFS) is used; preferably (but not required to be) a separate physical server from JobTracker
• Secondary NameNode: -  Periodically check-points the file system metadata on the NameNode
• DataNode: -  Stores HDFS files and handles HDFS read/write requests; preferably co-located with

TaskTracker for optimal data locality.
Hadoop servers are configured in the TaskTracker and DataNode in slave nodes. 
First 3 are the services present within Master and remaining 2 are in Slave. These are back-end processes. Communication in between master and slave will be like – the only NameNode in master communicates with only DataNode and Job Tracker communicates with Task Tracker.
Under Replication and Over Replication:
 Default replication factor is 3.
Over-replicated blocks are blocks that exceed their target replication for the file they belong to. Normally, over-replication is not a problem, and HDFS will automatically delete excess replicas. That’s how it's balanced in this case.
Under-replicated blocks are blocks that do not meet their target replication for the file they belong to. To balance this HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication.
NameNode and its Functionalities:
            NameNode i.e., the master should be configured with high-end devices.  Slaves can be made of any Commodity hardware. Data Node communicates with the NameNode by heartbeat signals. These signals are sent to the NameNode by DataNode for every 3 seconds(default). For every 10th heartbeat signal, all blocks in the DataNode send the block report to the NameNode as NameNode stores the metadata of the cluster. Block report mainly contains how many blocks cluster holds, file directories, disk usage, current activities.
            The NameNode assign the first block to the DataNode based on the network distance. The nearest DataNode within the cluster is the one which first responds to the NameNode’s request.
Recovering from the loss of a DataNode is relatively easy compared to that of a NameNode outage. In current versions of Apache Hadoop, there's no automated recovery provision made for a non-functional NameNode. The Hadoop NameNode is a notorious single point of failure (SPOF) Loss of a NameNode halts the cluster and can result in data loss if corruption occurs and data can’t be recovered. In addition, restarting the NameNode can take hours for larger clusters (assuming the data is recoverable).
DataNode and its Functionalities:
A DataNode stores data in the HDFS. A functional filesystem has more than one DataNode, with data replicated across them. 
On start-up, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host DataNode instances, so that MapReduce operations are performed close to the data. 
DataNode instances can talk to each other, which is what they do when they are replicating data.
·         There is usually no need to use RAID storage for DataNode data because data is designed to be replicated across multiple servers, rather than multiple disks on the same server.
·         An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of a CPU, and separate disks to read and write data. 
·         Avoid using NFS for data storage in a production system. 
HDFS ARCHITECTURE:
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manages storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

For further information regarding HDFS's

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

Thursday 29 December 2016

Installing Hadoop in our system

Please read about Hadoop Distributed File System before going to this
http://thecompletehadoop.blogspot.in/2016/12/hadoop-distributed-file-system.html
In general, we can install Hadoop in 3 different modes:

Standalone Mode

1.    Default mode of Hadoop.
2.    HDFS is not utilized in this mode.
3.    Local file system is used for input and output.
4.    Used for debugging purpose.
5.    No Custom Configuration is required in 3 Hadoop (mapred-site.xml, core-site.xml, hdfs-site.xml) files.
6.    Standalone mode is much faster than Pseudo-distributed mode.
7.    No Daemons and everything runs in a Single JVM (Java Virtual Machine).
8.    Is suitable for running map-reduce programs during development.

Pseudo Distributed Mode (Single Node Cluster)

1.     Configuration is required in given 3 files for this mode
2.    Replication factory is one for HDFS.
3.    Here one node will be used as Master Node / Data Node / Job Tracker / Task Tracker
4.    All Daemons run on same machine.
5.    Used for Real Code to test in HDFS.
6.    Pseudo distributed cluster is a cluster where all daemons are
running on one node itself and each daemon will have individual JVM.      

Fully distributed mode (or multiple node cluster)

1.    This is a Production Phase.
2.    Data are used and distributed across many nodes.
3.    Different Nodes will be used as Master Node / Data Node / Job Tracker / Task Tracker.
4.    Daemons run on a cluster of machines.
5.    There is one host onto which NameNode is running and another host on which DataNode is running and then there are machines on which task tracker is running. 

6.    In this distribution, separate masters and separate slaves are present.


Here in this blog we use ubuntu as primary os to install and run commands

How to Install Hadoop-1.2.1 (Single-Node Cluster) on Ubuntu-14.04, with JDK 8
In this DIY we will see how to set up a single-node Hadoop cluster backed by the Hadoop Distributed File System (HDFS), running on Ubuntu-14.04. The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.

This tutorial has been tested with the following software versions:
  • Ubuntu -14.04 (LTS)
  • Hadoop- 1.2.1, released Aug, 2013
  • JDK- 8 update 5


Step 1: Prerequisites

1. Download Hadoop
Hadoop-1.2.1 can be downloaded from here(Apache hadoop site). Select a mirror, then select 'hadoop-1.2.1/' directory and download hadoop-1.2.1.tar.gz. I assume you have downloaded it into your '/home/user_name/Downloads' directory.

2. Install JDK 8
Hadoop requires a working Java installation. So, open up a terminal and run the following
short-cut to open terminal (ctrl + Alt + T)
1
sudo add-apt-repository ppa:webupd8team/java
2
sudo apt-get update && sudo apt-get install oracle-java8-installer

It will take some time to download and install so sit back and wait. Once it's done then we have to add the JAVA_HOME to the Ubuntu environment. Run the following in a terminal to open up the /etc/environment file.
1
sudo gedit /etc/environment

Now, append the following at the end of the file and save it:
JAVA_HOME="/usr/lib/jvm/java-8-oracle"

3. Adding a dedicated Hadoop system user
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine. Following commands are used for our purpose, image 1 shows a typical output screen of the same.
1
sudo addgroup hadoop
2
sudo adduser --ingroup hadoop hduser


Image 1. Adding a Hadoop System User.

4. Installing and configuring SSH Server
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section. Install openssh server as:
1
sudo apt-get install openssh-server

Now, open a new terminal and switch to hduser then we have to generate an SSH key for the hduser user
1
su - hduser
2
ssh-keygen -t rsa -P ""

The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the pass-phrase every time Hadoop interacts with its nodes). The output will look like this:

Image 2. Creating an RSA Key Pair.

Now we have to enable SSH access to your local machine with this newly created key.
1
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to our local machine with the hduser user. The step is also needed to save our local machine’s host key fingerprint to the hduser user's known_hosts file.
1
ssh localhost

The output will look like this:

Image 3. Testing the ssh Setup.

NOTE: If you get an error saying “ssh: connect to host localhost port 22: Connection refused”, then you have not installed the ‘openssh-server’ properly, install it first.

5. Disabe IPv6
Close the current terminal by repeatedly typing "exit" until it doesn't get closed. Open a new terminal and open the file /etc/sysctl.conf
1
sudo gedit /etc/sysctl.conf

Now append the following lines at the end of the file and save it.
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

You have to restart your system in order to make the changes take effect. You can check whether IPv6 is enabled on your machine with the following command:
1
cat /proc/sys/net/ipv6/conf/all/disable_ipv6

A return value of 0 means IPv6 is enabled (image 4), a value of 1 means disabled (image 5, that’s what we want).

Image 4. IPv6 Enabled.

After restart

Image 5. IPv6 Disabled.


Step 2: Install Hadoop

1. Extract Hadoop and modify the permissions
Extract the contents of the Hadoop package to a location of your choice. Say ‘/usr/local/hadoop’. Make sure to change the owner of all the files to the hduser user and hadoop group. But first move the downloaded hadoop-1.2.1.tar.gz to ‘/user/local/’ (image 6).
1
sudo mv /home/user_name/Downloads/hadoop-1.2.1.tar.gz /usr/local
2
cd /usr/local
3
sudo tar xzf hadoop-1.2.1.tar.gz
4
sudo mv hadoop-1.2.1 hadoop
5
sudo chown –R hduser:hadoop hadoop


Image 6. Extracting and Modifying the Permissions.

2. Update '$HOME/.bashrc' of hduser

Open the $HOME/.bashrc file of user hduser
1
sudo gedit /home/hduser/.bashrc

and add the following lines at the end:
# Set Hadoop-related environment variables
export HADOOP_PREFIX=/usr/local/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_PREFIX/bin


Step 3: Configuring Hadoop

Now we have to configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little cluster only contains our single local machine.

1. Setting up the working directory
We will use the directory ‘/app/hadoop/tmp’ in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point. Now we create the directory and set the required ownerships and permissions:
1
sudo mkdir -p /app/hadoop/tmp
2
sudo chown hduser:hadoop /app/hadoop/tmp
3
sudo chmod 750 /app/hadoop/tmp

If you forget to set the required ownerships and permissions, you will see a java.io.IOException when you try to format the name node in the next section.

2. Configuring Hadoop setup files

I. hadoop-env.sh
The only required environment variable we have to configure for Hadoop is JAVA_HOME. Open conf/hadoop-env.sh (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the java8 directory.
1
sudo gedit /usr/local/hadoop/conf/hadoop-env.sh

Replace

Image 7. Replace this .

With
# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

II. core-site.xml
Open up the file /usr/local/hadoop/conf/core-site.xml
1
sudo gedit /usr/local/hadoop/conf/core-site.xml

and add the following snippet between <configuration>...</configuration> tags (see image 8). You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter – this parameter you must change to a directory of your choice.
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>


Image 8. Add as Shown.

III. mapred-site.xml
Open up the file /usr/local/hadoop/conf/mapred-site.xml
1
sudo gedit /usr/local/hadoop/conf/mapred-site.xml

add the following snippet between <configuration>...</configuration> tags.
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map and reduce task.</description>
</property>

IV. hdfs-site.xml
Open up the file /usr/local/hadoop/conf/hdfs-site.xml
1
sudo gedit /usr/local/hadoop/conf/hdfs-site.xml

add the following snippet between <configuration>...</configuration> tags.
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
</property>


Step 4: Formatting the HDFS Filesystem via the Namenode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

NOTE: Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the commands in a new terminal
1
su - hduser
2
/usr/local/hadoop/bin/hadoop namenode -format

The output will look like this:

Image 9. Formatting Namenode.


Step 5: Starting your Single-Node Cluster

1. Run the command:
1
/usr/local/hadoop/bin/start-all.sh

This will start a Namenode, a Datanode, a Jobtracker and a Tasktracker on your machine. The output will look like this:

Image 10. Starting the Single-Node Cluster.

2. Check whether the expected Hadoop processes are running:
1
cd /usr/local/hadoop
2
jps

The output will look like this (Process ids and ordering of processes may differ):

Image 11. Running Hadoop Processes.

If all the six processes are running then your Hadoop is working fine.

3. You can also check if Hadoop is listening on the configured ports. Open a new terminal and run
1
sudo netstat -plten | grep java

Output will look like this:

Image 12. Hadoop is Listening.


Step 6: Stopping your Single-Node Cluster

To stop all the daemons running on your machine, run the command:
1
/usr/local/hadoop/bin/stop-all.sh

The output will look like this:

Image 13. Stopping the Single-Node Cluster.

Congratulations! You have successfully installed your Hadoop. You can start working with Hadoop now; just remember to start your cluster first. Happy Hadooping !