Setting Up a Hadoop Cluster with Ansible Playbook

Rahul Bhardwaj
6 min readMar 27, 2021

Prerequisites

  • Basic knowledge of Hadoop Cluster
  • Ansible-playbook

Introduction

In this article we will set up a Hadoop cluster with the help of Ansible notebook on Redhat OS which will take as many slave nodes and client nodes in the cluster as you like.

I tried my best to highlight the confusing points you face while creating the playbook.

First, let’s discuss on basic terms what is a hadoop cluster and ansible-playbook.

Hadoop Cluster:

A Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform these kinds of parallel computations on big data sets.

Ansible-Playbook:

Ansible-Playbook is a file that will write down the steps to configure the nodes using yml. So, instead of using a ad-hoc command you can jot in down on a single file and run everything in a single go.

The Inventory

The playbook will be divided into four parts.

The first will do the common configuration on all the nodes irrespective of their work, second will do the name node configuration, third will do data node configuration and the fourth one will do the client configuration.

So, to make this work we have to create groups in our Inventory in such manner:

The hadoop:children part includes all the nodes and hence will do the configuration in all the nodes included.

Starting With The Playbook

The first part will be the common configuration in all the nodes. This will include the task of installing java software(jdk) and hadoop software. Different versions of java and hadoop may cause confict so I’m using the pre-tested softwares that I already downloaded.

In case you’re required to run the playbook again, installing the jdk file will cause error so I used ignore_errors to take care of that.

Name-Node

For configuration of the hadoop file we need two xml files in the /var/hadoop folder namely core-site.xml and hdfs-site.xml.

core-site.xml
hdfs-site.xml

As you can see the core-site.xml file requires the IP address of the name-node. In case you’re using this playbook to create a fresh cluster you will have to search again for the IP address. So, to make it dynamic we’ll take the help of ansible_facts that will take the take the information about the its ipv4 address from the data collected.

So our playbook will look something like this:

We also created the /nn (namenode) directory for our hadoop cluster using teh 0755 mode to make read and writable. We’re fetching the core-site.xml file that will come in handy in the upcoming section so hang on for that( make sure to use flat argument in fetch to get easier use of directory name).

Furthermore, there will be one tricky part where you may have a little hard time troubleshooting. As you know that before starting the name-node you have to format the node. But, if you’re running the playbook again in case you want to add a new node using the command in a straight manner won’t work.

What happens during the second time is it will ask you if you want to re-format the file system again which will then erase all the previously stored data on the cluster. So, to solve this problem we’ll make use the of the expect argument in command.

To use this module we’ll need to have pexpect module of python available on the target node.

Now, to prompt user to ask if you want to re-format it or not we have to make use of the vars_prompt.

Note: Make sure you to answer that prompt with a capital Y or N.

After prompting we’ll now use the expect argument in command module.

Here, we took the help of regular expression in Re-format.* . This will match any line starting from Re-format followed by any character.

The "{{ response }}" takes the response that you entered in vars_prompt using jinja.

Note: If you’re using the complete sentence instead of Re-format.* you might get error as there is use of space character which you may find difficult to identify so it is better to rely on regex.

After this we’ll procede with stopping firewall completely (for now) as hadoop cluster takes multiple ports to run. And then, we’ll start our name-node services.

Data-Node

The core-site.xml file and hdfs-site.xml file we’re going to use here will look something like this:

core-site.xml
hdfs-site.xml

As you can see, the file used in core-site.xml is the same we used in the name-node. The important thing required here is the name-node’s IP. So now the file we fetched earlier in the Name-Node section will come in use as it has written the name-node’s IP on it.

One more tricky part that comes here is for the hdfs-site.xml file, we need to add a different folder name in every node that we add. To do that we’ll use Special variables.

The inventory_hostname is the name of node in the inventory of the control node of ansible. So every node that we’ll add in datanode group of the inventory we’ll use its own name(node2/3/4/5) in this place which will make a directory of the same name in /home/devops/.

After this we’ll just use stop the firewall and start the data-node.

Client

Setting up the client is a simple part as it just needs one file to be configured and don’t need any service to be started.

It needs the same core-site.xml file that is being used in name and datanode so we’ll just copy that from the place it is fetched to.

Checking our Cluster

Name-node:

As you can see the hadoop dfsadmin -report command shows one node connected providing 16.99GB storage out of which 14.99GB is remaining.

Uploading and reading a file from client:

File uploaded and read successfully!

--

--