Quantcast
Channel: BigData Handler » BigData Handler
Viewing all articles
Browse latest Browse all 10

Hadoop multi-node cluster setup

$
0
0

The following document describes the required steps for setting up a distributed multi-node Apache Hadoop cluster on two Ubuntu machines, the best way to install and setup a multi node cluster is to start installing two individual single node Hadoop clusters by following my previous tutorial “setting up hadoop single node cluster on Ubuntu” and merge them together with minimal configuration changes in which one Ubuntu box will become the designated master and the other box’s will become a slave, we can add n number of slaves as per our future request.

Please follow my previous blog post for “setting up hadoop single node cluster on Ubuntu“

m1

1. Prerequisites

i.Networking

Networking plays an important role here, before merging both single node servers into a multi node cluster we need to make sure that both the node pings each other( they need to be connected on the same network / hub or both the machines can speak to each other). Once we are done with this process, we will be moving to the next step in selecting the master node and slave node, here we are selecting 172.16.17.68 as the master machine(Hadoopmaster) and 172.16.17.61 as a slave (hadoopnode) . Then we need to add them in ‘/etc/hosts’ file on each machine as follows.

sudo vi /etc/hosts

m2

172.16.17.68     Haadoopmaster
172.16.17.61     hadoopnode

Note: The addition of more slaves should be updated here in each machine using unique names for slaves (e.g.: 172.16.17.xx hadoonode01, 172.16.17.xy slave02 so on..).

m3

ii. Enabling SSH:

hduser on master(Hadoopmaster) machine need to able to connect to its own master (Hadoopmaster) account user and also need to connect hduser  to the slave (hadoopnode) machine via password-less SSH login.

hduser@Hadoopmaster:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoonode

m4

If you can see the below output when you run the given command on both master and slave, then we configured it correctly.

ssh Hadoopmaster
ssh hadoopnode

m5

m6

 

2. Configurations:

The following are the required files we will use for the perfect configuration of the multi node Hadoop cluster.

a.    masters

b.   slaves

c.    core-site.xml

d.   mapred-site.xml

e.    hdfs-site.xml

Lets configure each and every config file accordingly:

a. masters:

In master (Hadoopmaster) machine we need to configure masters file accordingly as shown in the image and add the master (Hadoopmaster) node name.

vi masters
Hadoopmaster

m7

m8

 

b. slaves:

Lists the hosts, one per line, where the Hadoop slave daemons (DataNodes and TaskTrackers) will be running as shown:

Hadoomaster
hadoopnode

m9

m10

If you have additional slave nodes, just add them to the conf/slaves file, one hostname per line.

Configuring all *-site.xml files:

We need to use the same configurations on all the nodes of hadoop cluster, i.e. we need to edit all *-site.xml files on each and every server accordingly.

c. core-site.xml:

We are changing the host name from ‘localhost’ to Hadoopmaster, which specifies the NameNode (the HDFS master) host and port.

vi core-site.xml

m11

m12

d. hdfs-site.xml:

We are changing the replication factor to “2”, The default value of dfs.replication is 3. However, we have only two nodes available, so we set dfs.replication to 2.

vi hdfs-site.xml

m13

e. mapred-site.xml:

We are changing the host name from ‘localhost’ to Hadoopmaster, which specifies the JobTracker (MapReduce master) host and port

vi mapred-site.xml

m14

3. Formatting and Starting/Stopping the HDFS filesystem via the NameNode:

The first step to starting up your multi–node Hadoop cluster is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your cluster. To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the given command.

hadoop namenode –format

 m15

4. Starting the multi-node cluster:

Starting the cluster is performed in two steps.

We begin by starting the HDFS daemons first, the NameNode daemon is started on Hadoopmaster and DataNode daemons are started on all nodes(slaves).

Then we will start the MapReduce daemons, the JobTracker is started on Hadoomaster and TaskTracker daemons are started on all nodes (slaves).

a. To start HDFS daemons:

start-dfs.sh

This will get NameNode up and DataNodes up listed in conf/slaves.

m16

 

By running jps command, we will see list of java processes running on master and slaves:

m17

b. To start  Map Red daemons:

start-mapred.sh

This will bring up the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.

m18

By running jps command, we will see list of java processes including JobTracker and TaskTracker running on master and slaves:.

m19

c. To stop Map Red daemons:

stop-mapred.sh

 m20

d. To stop HDFS daemons:

stop-dfs.sh

 m22

5. Running a Map-reduce Job:

Use a much larger volume of data as inputs as we are running in a cluster.

hadoop jar hadoop *examples*. jar wordcount  /user/hduser/demo  /user/hduser/demo-output

we can observe namenode,mapreduce,tasktracker process on the webinterface by following given url’s

*Hadoopmaster can be replaced with the machine ip

By this we are done in setting up a multi-node hadoop cluster, hope this step by step guide helps you to setup same environment at your place.

Please leave a comment in the comment section with your doubts, questions and suggestions, will try to answer asap :)


Viewing all articles
Browse latest Browse all 10

Trending Articles