The Hadoop Distributed Filesystem

An Introduction to Storing Large Files in Hadoop.

Published in

DataX Journal

4 min readJul 18, 2020

When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. A filesystem that manages the storage across a network of machines is called Distributed filesystems. Since they are network-based all the complications of network programming kick in, thus making distributed filesystem more complex than regular disk filesystems. For example, one of the biggest challenges is making the filesystem tolerate node failure without suffering data loss.

THE DESIGN OF HDFS

Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem. HDFS is a filesystem designed for storing very large files with data streaming data access patterns, running on clusters of commodity hardware. Let’s Examine this statement in more detail.

Very Large files :

“Very Large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.

Streaming Data access :

HDFS is built around the idea that most efficient data processing patterns is a write-once, read many times pattern. A dataset is typically generated or copied from the source and then various analyses are performed on that dataset over time. Each analysis will involve large proportions, if not all, of the dataset so the time to read the whole dataset than the latency in reading the first record.

Commodity Hardware:

HDFS can be deployed on our local machines. We do not require expensive hardware to load HDFS which provides a great advantage of being cost-efficient.

Namenodes and Datanodes

An HDFS cluster has two types of nodes operating in a master-work pattern; a namenode(the master) and a number of datanode (workers). The namenode manages filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located.

Datanode is the workhorses of the filesystem. They store and retrieve blocks when they are told to ( by the client or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

Without namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files in the system would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this.

The first way to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. These writes are synchronous and atomic. The usual configuration choice is to write to local disk as well as a remote NFS mount.

It is also possible to run a secondary namenode, which despite its name does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine because it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing. However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary dataloss is almost certain. The usual course of the action, in this case, is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary.

The Command-Line Interface

There are two properties that we set in the pseudo-distributed configuration that deserve further explanation. The first is fs.defaultFS.set to hdfs://localhost/, which is used to set a default filesystem for Hadoop. Filesystems are specified by a URL to configure Hadoop to use HDFS by default. The HDFS daemons will use this property to determine the host and port for the HDFS namenode. HDFS clients will use this property to work out where the namenode is running so they can connect to it.

We can set the second property, dfs.replication, to 1 so that HDFS doesn’t replicate filesystem blocks by the default factor of three. When running with a single datanode, HDFS can’t replicate blocks to three datanodes, so it would perpetually warn about blocks being under-replicated. This setting solves that problem.

Basic Filesystem Operation

We can do all of the usual filesystem operations, such as reading files, creating directories, moving files, deleting data, and listening directories. You can type hadoop fs -help to get detailed help on every command.

Some of the commands used in HDFS are :

Copying a file from the local filesystem to HDFS

hadoop fs -copyFromLocal /filename /HDFSfilelocation

Moving a file from the local filesystem to HDFS

hadoop fs -mv /filename /locationtobemoved

Creating a directory in HDFS

hadoop fs -mkdir /filename

Changing the file permissions

hadoop fs -chmod 777 /filename

Here Hadoop fs refers to the filesystem of the Hadoop.

If you want to check the files available in HDFS just simply type

hadoop fs -ls

This is how you can use the DFS of Hadoop. Today 90% of the files in the world are stored in filesystems based on the concept of Hadoop.