How to contribute a limited/specific amount of storage as slaves (Data Nodes) to the Hadoop HDFSCluster?
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
By default, Hadoop Slave Nodes/ Data Nodes contributes all the storage present in the “/” (root drive) to the Hadoop Cluster. E.g., If our Root Drive is having a capacity of 100GB then on starting this data node, we will be contributing this whole 100GB storage to the Cluster.
Hadoop does support this feature of contributing a limited/specific amount of storage as slaves/ data nodes to the Hadoop HDFS Cluster.
But we can achieve this requirement/ use case by integrating the concepts of Linux Partitions and Hadoop.
Step 1: Attach a Disk/Drive to the Data Node.
For the Testing purpose, I am running a Hadoop HDFS Cluster comprising of:
- One Name Node
- One Data Nodes
This Hadoop HDFS Cluster is running on AWS Cloud using 2 EC2 Instances.
In AWS, if we want to attach a Volume to the EC2 Instance, we use the EBS Volume Service. An Amazon EBS volume is a durable, block-level storage device that you can attach to your instances. After you attach a volume to an instance, you can use it as you would use a physical hard drive. EBS volumes are flexible or elastic.
To create an EBS Volume go to:
Volumes > Create Volume
To attach the EBS Volume, the EC2 Instance and EBS Volume should be in the same Availability Zone.
Now your EBS Volume is successfully attached to the datanode1 EC2 Instance. You can verify it using the AWS Web Console.
Connect to the datanode1 Instance using SSH or EC2 Instance Connect.
I haven’t started my datanode yet. So if you will run this command in the Name Node:
You can clearly see it. I have no Data Nodes yet and the Configured Capacity of the HDFS Cluster is 0.
Step 2: Create a Static partition in the EBS Volume attached to the Data Node earlier.
I am going to create a 10GB Partition in the /dev/xvdf Disk. It is the same Volume that was attached to this data node earlier.
Now our Partition of 10GB is created.
Step 3: Format this Parition (/dev/xvdf1)
Step 4: Mount this Partition (/dev/xvdf1) to a directory.
Step 5: Update /etc/hadoop/hdfs-site.xml File.
Step 6: Start Data Node Service.
Now if you will check in the name node, you will be seeing only 10 GB storage is contributed by the data node to the Hadoop HDFS Cluster.
You can verify this by running this command in the Name Node.
That’s all for today! I’ll be back with some new articles very soon, thanks!
Muhammad Tabish Khanday