The following document describes the required steps for setting up a distributed multi-node Apache Hadoop cluster on two Ubuntu machines, the best way to install and setup a multi node cluster is to start installing two individual single node Hadoop clusters by following my previous tutorial “setting up hadoop single node cluster on Ubuntu” and merge them together with minimal configuration changes in which one Ubuntu box will become the designated master and the other box’s will become a slave, we can add n number of slaves as per our future request.
Please follow my previous blog post for “setting up hadoop single node cluster on Ubuntu“
1. Prerequisites
i.Networking
Networking plays an important role here, before merging both single node servers into a multi node cluster we need to make sure that both the node pings each other( they need to be connected on the same network / hub or both the machines can speak to each other). Once we are done with this process, we will be moving to the next step in selecting the master node and slave node, here we are selecting 172.16.17.68 as the master machine(Hadoopmaster) and 172.16.17.61 as a slave (hadoopnode) . Then we need to add them in ‘/etc/hosts’ file on each machine as follows.
sudo vi /etc/hosts
172.16.17.68 Haadoopmaster 172.16.17.61 hadoopnode
Note: The addition of more slaves should be updated here in each machine using unique names for slaves (e.g.: 172.16.17.xx hadoonode01, 172.16.17.xy slave02 so on..).
ii. Enabling SSH:
hduser on master(Hadoopmaster) machine need to able to connect to its own master (Hadoopmaster) account user and also need to connect hduser to the slave (hadoopnode) machine via password-less SSH login.
hduser@Hadoopmaster:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoonode
If you can see the below output when you run the given command on both master and slave, then we configured it correctly.
ssh Hadoopmaster ssh hadoopnode
2. Configurations:
The following are the required files we will use for the perfect configuration of the multi node Hadoop cluster.
a. masters
b. slaves
c. core-site.xml
d. mapred-site.xml
e. hdfs-site.xml
Lets configure each and every config file accordingly:
a. masters:
In master (Hadoopmaster) machine we need to configure masters file accordingly as shown in the image and add the master (Hadoopmaster) node name.
vi masters Hadoopmaster
b. slaves:
Lists the hosts, one per line, where the Hadoop slave daemons (DataNodes and TaskTrackers) will be running as shown:
Hadoomaster hadoopnode
If you have additional slave nodes, just add them to the conf/slaves file, one hostname per line.
Configuring all *-site.xml files:
We need to use the same configurations on all the nodes of hadoop cluster, i.e. we need to edit all *-site.xml files on each and every server accordingly.
c. core-site.xml:
We are changing the host name from ‘localhost’ to Hadoopmaster, which specifies the NameNode (the HDFS master) host and port.
vi core-site.xml
d. hdfs-site.xml:
We are changing the replication factor to “2”, The default value of dfs.replication is 3. However, we have only two nodes available, so we set dfs.replication to 2.
vi hdfs-site.xml
e. mapred-site.xml:
We are changing the host name from ‘localhost’ to Hadoopmaster, which specifies the JobTracker (MapReduce master) host and port
vi mapred-site.xml
3. Formatting and Starting/Stopping the HDFS filesystem via the NameNode:
The first step to starting up your multi–node Hadoop cluster is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your cluster. To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the given command.
hadoop namenode –format
4. Starting the multi-node cluster:
Starting the cluster is performed in two steps.
We begin by starting the HDFS daemons first, the NameNode daemon is started on Hadoopmaster and DataNode daemons are started on all nodes(slaves).
Then we will start the MapReduce daemons, the JobTracker is started on Hadoomaster and TaskTracker daemons are started on all nodes (slaves).
a. To start HDFS daemons:
start-dfs.sh
This will get NameNode up and DataNodes up listed in conf/slaves.
By running jps command, we will see list of java processes running on master and slaves:
b. To start Map Red daemons:
start-mapred.sh
This will bring up the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.
By running jps command, we will see list of java processes including JobTracker and TaskTracker running on master and slaves:.
c. To stop Map Red daemons:
stop-mapred.sh
d. To stop HDFS daemons:
stop-dfs.sh
5. Running a Map-reduce Job:
Use a much larger volume of data as inputs as we are running in a cluster.
hadoop jar hadoop *examples*. jar wordcount /user/hduser/demo /user/hduser/demo-output
we can observe namenode,mapreduce,tasktracker process on the webinterface by following given url’s
- http://Hadoopmaster:50070/ – web UI of the NameNode daemon
- http://Hadoopmaster:50030/ – web UI of the JobTracker daemon
- http://Hadoopmaster:50060/ – web UI of the TaskTracker daemon
*Hadoopmaster can be replaced with the machine ip
By this we are done in setting up a multi-node hadoop cluster, hope this step by step guide helps you to setup same environment at your place.
Please leave a comment in the comment section with your doubts, questions and suggestions, will try to answer asap