CHAPTER 3. WORKING OF HADOOP
For distributed storage and distributed computation Hadoop uses a master/slave
architecture.The distributed storage system in hadoop is called as Hadoop Distributed File
System or HDFS. A client interacts with HDFS by communicating with the NameNode and
DataNodes.The user does not know about the assignment of NameNode and DataNode for
functioning.i.e which NameNode and DataNode are assigned or will be assigned. HDFS
follows the master-slave architecture and it has the following elements.
1.NAME NODE
The name node is the commodity hardware that contains the GNU/Linux operating system
and the name node software. It is software that can be run on commodity hardware. The
system having the name node acts as the master server and it does the following tasks:
Manages the file system namespace. Regulates client.s access to files and It also executes
file system operations such as renaming, closing, and opening files and directories[1].
2.DATA NODE
The data node is a commodity hardware having the GNU/Linux operating system and data
node software. For every node (Commodity hardware/System) in a cluster, there will be
a data node. These nodes manage the data storage of their system. Data nodes perform
read-write operations on the file systems, as per client request. They also perform operations
such as block creation, deletion, and replication according to the instructions of the name
node[1].
3.BLOCK
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are
called as blocks. In other words, the minimum amount of data that HDFS can read or write
is called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration[1].
3.2 Hadoop Operation Modes
Once you have downloaded Hadoop, you can operate your Hadoop cluster in one
of the three supported modes
• Local/Standalone Mode: After downloading Hadoop in your system, by default, it
is configured in a standalone mode and can be run as a single java process.
• Pseudo Distributed Mode: It is a distributed simulation on single machine. Each
Hadoop daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process.
This mode is useful for development.
SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 17