How to contribute a limited/specific amount of storage as slaves (Data Nodes) to the Hadoop HDFSCluster?

Image for post
Image for post

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

By default, Hadoop Slave Nodes/ Data Nodes contributes all the storage present in the “/” (root drive) to the Hadoop Cluster. E.g., If our Root Drive is having a capacity of 100GB then on starting this data node, we will be contributing this whole 100GB storage to the Cluster.

Hadoop does support this feature of contributing a limited/specific amount of storage as slaves/ data nodes to the Hadoop HDFS Cluster.

But we can achieve this requirement/ use case by integrating the concepts of Linux Partitions and Hadoop.

Step 1: Attach a Disk/Drive to the Data Node.

For the Testing purpose, I am running a Hadoop HDFS Cluster comprising of:

  1. One Name Node
  2. One Data Nodes

This Hadoop HDFS Cluster is running on AWS Cloud using 2 EC2 Instances.

In AWS, if we want to attach a Volume to the EC2 Instance, we use the EBS Volume Service. An Amazon EBS volume is a durable, block-level storage device that you can attach to your instances. After you attach a volume to an instance, you can use it as you would use a physical hard drive. EBS volumes are flexible or elastic.

To create an EBS Volume go to:

Volumes > Create Volume

Image for post
Image for post

To attach the EBS Volume, the EC2 Instance and EBS Volume should be in the same Availability Zone.

Image for post
Image for post
Image for post
Image for post

Now your EBS Volume is successfully attached to the datanode1 EC2 Instance. You can verify it using the AWS Web Console.

Image for post
Image for post

Connect to the datanode1 Instance using SSH or EC2 Instance Connect.

Image for post
Image for post
Image for post
Image for post

I haven’t started my datanode yet. So if you will run this command in the Name Node:

Image for post
Image for post

You can clearly see it. I have no Data Nodes yet and the Configured Capacity of the HDFS Cluster is 0.

Step 2: Create a Static partition in the EBS Volume attached to the Data Node earlier.

I am going to create a 10GB Partition in the /dev/xvdf Disk. It is the same Volume that was attached to this data node earlier.

Image for post
Image for post
Image for post
Image for post

Now our Partition of 10GB is created.

Step 3: Format this Parition (/dev/xvdf1)

Image for post
Image for post

Step 4: Mount this Partition (/dev/xvdf1) to a directory.

Image for post
Image for post

Step 5: Update /etc/hadoop/hdfs-site.xml File.

Image for post
Image for post

Step 6: Start Data Node Service.

Image for post
Image for post

Now if you will check in the name node, you will be seeing only 10 GB storage is contributed by the data node to the Hadoop HDFS Cluster.

You can verify this by running this command in the Name Node.

Image for post
Image for post

That’s all for today! I’ll be back with some new articles very soon, thanks!

Muhammad Tabish Khanday

Lets code 💻.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store