BIG DATA ANALYTICS USING HADOOP

Seminar Report submitted to Savitribai Phule Pune University,Ganeshkhind

In partial Fulﬁllment for the awards of Degree of Engineering in Computer Engineering

Submitted by

Sarita Bagul T120414208

Under the Guidance of

Asst.Prof.B.A.Khivsara

March,2016 - 17

Department of Computer Engineering

SNJB’S Late Sau. Kantabai Bhavarlalji Jain,

College of Engineering, Chandwad

Dist:Nashik

SNJB’S Late Sau. Kantabai Bhavarlalji Jain,

College of Engineering, Chandwad

Dist:Nashik

Department of Computer Engineering

2016-17

Certiﬁcate

This is to certify that the Seminar Report entitled Big Data Analytics using Hadoop sub-

mitted by Ms. Sarita Bagul is a record of bonaﬁed work carried out at SNJB’s K. B. Jain

College of Engineering during academic year 2016-17, which aﬃliated to the Savitribai Phule

Pune University.

Date:

Place: Chandwad

Seminar Guide

Asst.Prof.B.A.Khivsara

Prof. K.M. Sanghavi Dr. M.D.Kokate,

HOD Principal

Department of Computer Engineering

Examiner:.......................................................................................................

Acknowledgement

I would like to acknowledge all the people who have been of the help and assisted

me throughout my project work. First of all I would like to thank my respected guide

Asst.Prof.B.A.Khivsara, Assistant Professor in Department of Computer Engineering for in-

troducing me throughout features needed. The time-to-time guidance, encouragement, and

valuable suggestions received from him are unforgettable in my life. This work would not

have been possible without the enthusiastic response, insight, and new ideas from her.

I am also grateful to all the faculty members of SNJB’s College of Engineering for

their support and cooperation.

I would like to thank my lovely parents for time-to-time support and encourage-

ment and valuable suggestions, and thank my friends for their valuable support and encour-

agement.

The acknowledgement would be incomplete without mention of the blessing of the

Almighty, which helped me in keeping high moral during most diﬃcult period.

Sarita Bagul

Abstract

The process of analyzing and mining Big Data is nothing but the Big data analyt-

ics. Big data is a collection of large data sets that include diﬀerent types such as structured,

unstructured and semi-structured data. Now days, the challenge is extracting meaningful

information and analyzing Big Data .In this report we ﬁrst introduce the general background

of big data. Big data exceeds the processing capability of traditional database to capture,

manage, and process the voluminous amount of data.

Big data, which refers to the data sets that are too big to be handled using the

existing database management tools, are emerging in many important applications, such as

Internet search, business informatics, social networks, social media, genomics, and meteo-

rology. Big data presents a grand challenge for database and data analytics research. In this

talk, I review the exciting activities in my group addressing the big data challenge.

In this talk, I review about technique how Big Data is analyzed by using the tech-

nique of Hadoop and then focus on hadoop platform using map reduce algorithm which pro-

vide the environment to implement application in distributed environment. Since Hadoop

has emerged as a popular tool for Big data implementation, this report deals with the overall

architecture of Hadoop along with the details of its various components. Hadoop is designed

to scale up from a single server to thousands of machines, with a very high degree of fault

tolerance.

Keywords:Big data, Map Reduce, HDFS, Hadoop.

Contents

Acknowledgement i

Abstract ii

1 Introduction 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Characteristics Of Big Data . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Big Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Features Of Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature Survey 5

2.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 MapReduce Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Advantages Of MapReduce . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.3 Disadvantages Of MapReduce . . . . . . . . . . . . . . . . . . . . . . 7

2.1.4 Application Of MapReduce . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Features Of NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Characteristics Of NoSQL . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.3 NoSQL Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.4 Advantages Of NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.5 Disadvantages Of NoSQL . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Hadoop As An Open Source Tool For Big Data Analytics . . . . . . . . . . . 13

2.3.1 Hadoop Technology Stack . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Components Of Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Working Of Hadoop 15

3.1 Architecture of Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Hadoop Distributed File System (HDFS) . . . . . . . . . . . . . . . . 16

3.2 Hadoop Operation Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

iii

3.3 Installation And Setup Of Hadoop . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Installing Hadoop in Standalone Mode . . . . . . . . . . . . . . . . . 19

3.3.2 Installing Hadoop in Pseudo Distributed Mode . . . . . . . . . . . . . 20

3.3.3 Verifying Hadoop Installation . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Some essential Hadoop project . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Application Of Big Data Analytics Using Hadoop 26

4.1 Application Of Big Data Analytics Using Hadoop . . . . . . . . . . . . . . . 26

4.2 Health Care Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Hadoop Technology In Monitoring Patient Vitals . . . . . . . . . . . 27

5 Advantages/Disadvantages of Hadoop 28

5.1 Advantages Of Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2 Disadvantages Of Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Conclusion 30

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 Bibliography 31

List of Figures

1.1 5 Vs Of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Value Of Big Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 MapReduce Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 NoSQL Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Key Value Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Graph Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Hadoop Technology Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Hadoop Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 HDFS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Core-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 hdfs-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 hdfs-ﬁle.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.6 yarn-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.7 mapred-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.8 Hadoop on Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.9 Hadoop application on Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Hadoop Technology In Monitoring Patient Vitals . . . . . . . . . . . . . . . 27

Chapter 1

Introduction

1.1 Introduction

A massive volume of both structured and unstructured data that is so large

that it’s diﬃcult to process with traditional database and software techniques.The users

are producing more and more data through communication media in the unstructured form

which is highly unmanageable and this management of data is the challenging job. The main

focus is to gather the unstructured data, processed the data to convert into structured form

so that accessing of the data would be easier . Accordingly, data is analyzed and processed to

convert it into meaningful and right information by using the concept of Big Data Analytics.

An example of big data may be Exabytes of data consisting of trillions of records

of millions of people from diﬀerent sources such as websites, social media, mobile data, web

servers, online transactions and so on[1].

1.1.1 Characteristics Of Big Data

[4]As the data is too big and comes from various sources in diﬀerent form, it is characterized

by the following ﬁve components:

• Volume: The quantity of generated and stored data. The size of the data determines

the value and potential insight and whether it can actually be considered big data or

not.

• Variety: The type and nature of the data. This helps people who analyze it to

eﬀectively use the resulting insight.

• Velocity: In this context, the speed at which the data is generated and processed to

meet the demands and challenges that lie in the path of growth and development.

• Value: Inconsistency of the data set can hamper processes to handle and manage it.

CHAPTER 1. INTRODUCTION

Figure 1.1: 5 Vs Of Big Data

• Veracity: The quality of captured data can vary greatly, aﬀecting accurate analysis.

1.2 Big Data Analytics

Big Data analytics is the process of analyzing and mining Big Data. Big data ana-

lytics is the process of analyzing big data to ﬁnd hidden patterns, unknown correlations and

other useful information that can be extracted to make better decisions.Big data analytics

is where advanced analytic techniques operate on big data sets. Hence, big data analytics

is really about two thingsbig data and analytics plus how the two have teamed up to create

one of the most profound trends in business intelligence (BI) today[11].Big data analytics is

helpful in managing more or diverse data.

Figure 1.2 is demonstrating the value of Big Data Analytics by drawing the graph

between time and cumulative cash ﬂow. Old analytics techniques like any data warehousing

application, you have to wait hours or days to get information as compared to Big Data

Analytics. Information has the timeliness value when it is processed at right time otherwise

it would be of no use. It might not return its value at proper cost.

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 2

CHAPTER 1. INTRODUCTION

Figure 1.2: Value Of Big Data Analytics

1.3 Hadoop

Hadoop is an open source, Java-based programming framework that supports

the processing and storage of extremely large data sets in a distributed computing envi-

ronment[7]. Hadoop is used to develop applications that could perform complete statistical

analysis on huge amounts of data.

Hadoop is by far the most popular implementation of MapReduce, being an en-

tirely open source platform for handling Big Data. Hadoop runs applications using the

MapReduce algorithm, where the data is processed in parallel on diﬀerent CPU nodes. In

short, Hadoop framework is capable enough to develop applications capable of running on

clusters of computers and they could perform complete statistical analysis for huge amounts

of data.

Hadoop has two major layers namely:

• 1.Processing/Computation layer (Map Reduce), and

• 2.Storage layer (Hadoop Distributed File System)[1].

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 3

CHAPTER 1. INTRODUCTION

1.4 Features Of Hadoop

• Scalability: Nodes can be easily added and removed. Failed nodes can be easily

detected.

• Low cost: As Hadoop is an open source framework,it is free. It commodity hardware

to store and process huge data.Hence not much costly.

• High computing power: Hadoop uses distributed computing model.Due to this task

can be distributed amongst diﬀerent nodes and can be processed quickly.

• Huge and Flexible storage: Massive data storage is available due to thousands of

nodes in the cluster.It supports both structured and unstructured data.No preprocess-

ing is required on data before storing it.

• Fault tolerance and data protection:If any node fails the task in hand are au-

tomatically Redirected to other nodes.Multiple copies of all data are automatically

stored.

• Economical: It distributes the data and processing across clusters of commonly avail-

able computers (in thousands).

• Eﬃcient: By distributing the data, it can process it in parallel on the nodes where

the data is located.

• Reliable: It automatically maintains multiple copies of data and automatically rede-

ploys computing tasks based on failures[18].

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 4

Chapter 2

Literature Survey

This survey illustrated that in olden days through RDBMS tools ,the data was

less and easily handled by RDBMS but recently it is diﬃcult to handle huge data, which

is preferred as data.[2]. Big data exhibits diﬀerent characteristics like volume, variety, vari-

ability, value, velocity and complexity due to which it is very diﬃcult to analyse data and

obtain information with traditional data mining techniques[1].

The tools used for mining big data are apache hadoop, apache pig, cascading,

scribe, storm, apache Hbase etc. Thus, instructed that our ability to handle many Exabyte.s

of data mainly dependent on existence of rich variety dataset, technique, software framework.

There are many old technologies already present used for big data handling but each one of

them has some advantages and disadvantages. There are number of technologies are there

few of them are mentioned below:

• Column-oriented databases

• NoSQL databases

• MapReduce

• Hive

• Pig

• WibiData

• PLATFORA

• Apache Zeppelin

• Hadoop

CHAPTER 2. LITERATURE SURVEY

2.1 MapReduce

MapReduce provides analytical capabilities for analyzing huge volumes of complex

data. The Map Reduce program is an Apache open-source framework which runs on Hadoop.

MapReduce is a software framework. Data processing is done by MapReduce.MapReduce

scales and runs an application to diﬀerent cluster machines.

2.1.1 MapReduce Architecture

MapReduce works by breaking the processing into two phases: the map and re-

duce.Each phase has a key-value pairs as input and output,the type of which may be chosen

by the programmer.The major advantage of Map Reduce is that it is easy to scale data pro-

cessing over multiple computing nodes. Under the Map Reduce model, the data processing

primitives are called mappers and reducers.

Figure 2.1: MapReduce Architecture

Mapping and Reducing are two important phases for executing an application

program.In the Mapping phase MapReduce takes input data, ﬁlters that input data and then

transform each data element to the mapper.In the Reducing phase, the reducer processes

all the outputs from the mapper, aggregates all and the outputs and then provides a ﬁnal

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 6

CHAPTER 2. LITERATURE SURVEY

result.Decomposing a data processing application into mappers and reducers is sometimes

nontrivial.This simple scalability is what has attracted many programmers to use the Map

Reduce model[1].

• map: the function takes key/value pairs as input and generates an intermediate set of

key/value pairs.

• reduce: the function which merges all the intermediate values associated with the

same intermediate key.

2.1.2 Advantages Of MapReduce

• Scalability

• Cost-eﬀective solution

• Flexibility

• Fast

• Security and Authentication

• Parallel processing

• Availability and resilient nature

2.1.3 Disadvantages Of MapReduce

• Batch processing, not interactive.

• Design for a speciﬁc problem domain.

• MapReduce programming paradigm not commonly understood.

• The index generated in the Map step is one dimensional, and the Reduce step must not

generate a large amount of data or there will be a serious performance degradation.

• MapReduce may not be a good ﬁt for full-text indexing or ad hoc searching.

• Lack of trained support professional.

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 7

CHAPTER 2. LITERATURE SURVEY

2.1.4 Application Of MapReduce

• Distributed Grep

• Large-Scale PDF Generation

• Artiﬁcial Intelligence

• Geographical data

2.2 NoSQL

[13] NoSQL (originally referring to SQL. or relational.) database provides a mech-

anism for storage and retrieval of data that is modeled in means other than the tabular

relations used in relation databases (RDBMS). It encompasses a wide variety of diﬀerent

database technologies that were developed in response to a rise in the volume of data stored

about users, objects and products, the frequency in which this data is accessed, and per-

formance and processing needs. Generally, NoSQL databases are structured in a key-value

pair, graph database, document-oriented or column-oriented structure.

2.2.1 Features Of NoSQL

[14]Complexity of SQL query Burden of up-front schema design DBA presence Transac-

tions(It should be handled at application layer) Easy and frequent changes to DB Fast

development Large data volumes(eg.Google) Schema less Open source development.

2.2.2 Characteristics Of NoSQL

• Almost inﬁnite horizontal scaling

• Simpler and faster

• No ﬁxed table schemas

• No join operations

• Structured storage

• Almost everything happens in RAM[14]

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 8

CHAPTER 2. LITERATURE SURVEY

Figure 2.2: NoSQL Architecture

2.2.3 NoSQL Categories

[15] are four general types (most common categories) of NoSQL databases. Each

of these categories has its own speciﬁc attributes and limitations. There is not a single

solutions which is better than all the others, however there are some databases that are

better to solve speciﬁc problems. To clarify the NoSQL databases, lets discuss the most

common categories :

• Key-value stores

• Column-oriented

• Graph

• Document oriented

Key-Value Store

• Key-value stores are most basic types of NoSQL databases.

• Designed to handle huge amounts of data.

• Based on Amazons Dynamo paper.

• Key value stores allow developer to store schema-less data.

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 9

CHAPTER 2. LITERATURE SURVEY

• In the key-value storage, database stores data as hash table where each key is unique

and the value can be string, JSON, BLOB (Binary Large Object)etc.

• For example a key-value pair might consist of a key like ”Name” that is associated

with a value like ”Robin”.

• A key may be strings, hashes, lists, sets, sorted sets and values are stored against these

keys.

• Key-Value stores can be used as collections, dictionaries, associative arrays etc.

• Key-Value stores follow the ’Availability’ and ’Partition’ aspects of CAP theorem.

• Key-Values stores would work well for shopping cart contents, or individual values like

color schemes, a landing page URI, or a default account number.

Figure 2.3: Key Value Store

Example of Key-value store DataBase : Redis, Dynamo, Riak. etc.

Column-Oriented Databases

• Column-oriented databases primarily work on columns and every column is treated

individually.

• Values of a single column are stored contiguously.

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 10

CHAPTER 2. LITERATURE SURVEY

• Column stores data in column speciﬁc ﬁles.

• In Column stores, query processors work on columns too.

• All data within each column dataﬁle have the same type which makes it ideal for

compression.

• Column stores can improve the performance of queries as it can access speciﬁc column

data.

• High performance on aggregation queries (e.g. COUNT, SUM, AVG, MIN, MAX).

• Works on data warehouses and business intelligence, customer relationship manage-

ment (CRM), Library card catalogs etc.

Example of Column-oriented databases : BigTable, Cassandra, SimpleDB etc.

Graph Databases

A graph data structure consists of a ﬁnite (and possibly mutable) set of ordered

pairs, called edges or arcs, of certain entities called nodes or vertices. The following picture

presents a labeled graph of 6 vertices and 7 edges.

Figure 2.4: Graph Database

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 11

CHAPTER 2. LITERATURE SURVEY

• A graph database stores data in a graph.

• It is capable of elegantly representing any kind of data in a highly accessible way.

• A graph database is a collection of nodes and edges.

• Each node represents an entity (such as a student or business) and each edge represents

a connection or relationship between two nodes.

• Every node and edge are deﬁned by a unique identiﬁer.

• Each node knows its adjacent nodes.

• As the number of nodes increases, the cost of a local step (or hop) remains the same.

Example of Graph databases : OrientDB, Neo4J, Titan.etc.

Document Oriented Databases

• A collection of documents.

• Data in this model is stored inside documents.

• A document is a key value collection where the key allows access to its value.

• Documents are not typically forced to have a schema and therefore are ﬂexible and

easy to change.

• Documents are stored into collections in order to group diﬀerent kinds of data.

• Documents can contain many diﬀerent key-value pairs, or key-array pairs, or even

nested documents.

Example of Document Oriented databases : MongoDB, CouchDB etc.

2.2.4 Advantages Of NoSQL

• High scalability[16]

• Distributed Computing[16]

• Lower cost[16]

• Schema ﬂexibility, semi-structure data[16]

• No complicated Relationships[16]

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 12

CHAPTER 2. LITERATURE SURVEY

• It supports query language.

• It provides fast performance.

• It provides horizontal scalability.

2.2.5 Disadvantages Of NoSQL

• No standardization

• Limited query capabilities (so far)

• Eventual consistent is not intuitive to program[16].

2.3 Hadoop As An Open Source Tool For Big Data

Analytics

[3] Hadoop is a distributed software solution. It is a scalable fault tolerant dis-

tributed system for data storage and processing.There are two main components in Hadoop.

Figure 2.5: Hadoop Technology Stack

• 1)HDFS (which is a storage)

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 13

CHAPTER 2. LITERATURE SURVEY

• 2)MapReduce (which is retrieval and processing).

The hadoop core/common which consists of HDFS (Distributed Storage) which is a pro-

grammable interface to access stored data in cluster.

2.3.1 Hadoop Technology Stack

YARN (Yet Another Resource Negotiation) : Its a Map Reduce version 2. This is

future stuﬀ. This is stuﬀ which is currently alpha and yet to come. It is rewrite of Map

Reduce 1[3].

2.4 Components Of Hadoop

• Avro: A serialization system for eﬃcient,cross-language PRC and persistent data

storage.

• Pig:A data ﬂow language and execution environment for exploring very large dataset.

Pig runs on HDFS and MapReduce clusters.

• Hive:Distributed data warehouse .Hive manages data stored in HDFS and provides a

query language based on SQL for querying the data.

• HBase:A distributed,column-oriented database. HBase uses HDFS for its underly-

ing storage, and supports both batch-style computations using Mapreduce and point

queries.

• ZooKeeper: A distributed,highly available coordination service. ZooKeeper provides

primitives such as distributed locks that can be used for building distributed applica-

tions.

• Sqoop:A tool for eﬃcient bulk transfer of data between structured data stores and

HDFS.

• Oozie:A service for running and scheduling workﬂows of Hadoop jobs[5].

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 14

Chapter 3

Working Of Hadoop

3.1 Architecture of Hadoop

Figure 3.1: Hadoop Architecture

[18] Hadoop is an open source, Java-based programming framework that supports

the processing and storage of extremely large data sets in a distributed computing envi-

ronment. Hadoop is used to develop applications that could perform complete statistical

analysis on huge amounts of data.

Hadoop architecture includes following four modules:

• Hadoop Common: These are Java libraries and utilities required by other Hadoop

CHAPTER 3. WORKING OF HADOOP

modules. This library provides OS level abstractions and contains the necessary Java

ﬁles and scripts required to start Hadoop.

• Hadoop YARN: This is a framework for job scheduling and cluster resource man-

agement.

• MapReduce: MapReduce is a software framework. Data processing is done by

MapReduce.MapReduce scales and runs an application to diﬀerent cluster machines.

• Hadoop Distributed File System (HDFS): The Hadoop Distributed File System

(HDFS) is a distributed ﬁle system designed to run on commodity hardware. It has

many similarities with existing distributed ﬁle systems.

3.1.1 Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a distributed ﬁle system designed

to run on commodity hardware. It has many similarities with existing distributed ﬁle sys-

tems. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It

provides high throughput access to application data and is suitable for applications having

large datasets. To interact with HDFS hadoop provides a command interface[1].

Figure 3.2: HDFS Architecture

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 16

CHAPTER 3. WORKING OF HADOOP

For distributed storage and distributed computation Hadoop uses a master/slave

architecture.The distributed storage system in hadoop is called as Hadoop Distributed File

System or HDFS. A client interacts with HDFS by communicating with the NameNode and

DataNodes.The user does not know about the assignment of NameNode and DataNode for

functioning.i.e which NameNode and DataNode are assigned or will be assigned. HDFS

follows the master-slave architecture and it has the following elements.

1.NAME NODE

The name node is the commodity hardware that contains the GNU/Linux operating system

and the name node software. It is software that can be run on commodity hardware. The

system having the name node acts as the master server and it does the following tasks:

Manages the ﬁle system namespace. Regulates client.s access to ﬁles and It also executes

ﬁle system operations such as renaming, closing, and opening ﬁles and directories[1].

2.DATA NODE

The data node is a commodity hardware having the GNU/Linux operating system and data

node software. For every node (Commodity hardware/System) in a cluster, there will be

a data node. These nodes manage the data storage of their system. Data nodes perform

read-write operations on the ﬁle systems, as per client request. They also perform operations

such as block creation, deletion, and replication according to the instructions of the name

node[1].

3.BLOCK

Generally the user data is stored in the ﬁles of HDFS. The ﬁle in a ﬁle system will be divided

into one or more segments and/or stored in individual data nodes. These ﬁle segments are

called as blocks. In other words, the minimum amount of data that HDFS can read or write

is called a Block. The default block size is 64MB, but it can be increased as per the need to

change in HDFS conﬁguration[1].

3.2 Hadoop Operation Modes

Once you have downloaded Hadoop, you can operate your Hadoop cluster in one

of the three supported modes

• Local/Standalone Mode: After downloading Hadoop in your system, by default, it

is conﬁgured in a standalone mode and can be run as a single java process.

• Pseudo Distributed Mode: It is a distributed simulation on single machine. Each

Hadoop daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process.

This mode is useful for development.

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 17

CHAPTER 3. WORKING OF HADOOP

• Fully Distributed Mode: This mode is fully distributed with minimum two or more

machines as a cluster. We will come across this mode in detail in the coming chapters.

3.3 Installation And Setup Of Hadoop

[17] Before installing Hadoop into the Linux environment, we need to set up Linux

using ssh (Secure Shell). Follow the steps given below for setting up the Linux environment.

1.Creating a User

Open the Linux terminal and type the following commands to create a user.

$ su

password:

# useradd hadoop

# passwd hadoop

New passwd:

Retype new passwd

2.SSH Setup And Key Generation

The following commands are used for generating a key value pair using SSH. Copy

the public keys form id rsa.pub to authorized keys, and provide the owner with read and

write permissions to authorized keys ﬁle respectively.

$ ssh-keygen -t rsa

$ cat /.ssh/id rsa.pub >> ~/.ssh/authorized keys

$ chmod 0600 ~/.ssh/authorized keys

3.Installing Java

First of all, you should verify the existence of java in your system using the command -

version.. The syntax of java version command is given below.

$ java -version

If everything is in order, it will give you the following output.

java version

1.7.0 71”

Java(TM) SE Runtime Environment (build 1.7.0 71-b13)

Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

4.Downloading Hadoop

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 18

CHAPTER 3. WORKING OF HADOOP

Download and extract Hadoop 2.4.1 from Apache software foundation using the following

commands.

$ su

password:

# cd /usr/local

# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/

hadoop-2.4.1.tar.gz

# tar xzf hadoop-2.4.1.tar.gz

# mv hadoop-2.4.1/* to hadoop/

# exit

3.3.1 Installing Hadoop in Standalone Mode

Step 1: Setting Up Hadoop

You can set Hadoop environment variables by appending the following commands to /.bashrc

ﬁle.

export HADOOP HOME=/usr/local/hadoop

Before proceeding further, you need to make sure that Hadoop is working ﬁne. Just issue

the following command:

$ hadoop version

If everything is ﬁne with your setup, then you should see the following result:

Hadoop 2.4.1

Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768

Compiled by hortonmu on 2013-10-07T06:28Z

Compiled with protoc 2.5.0

From source with checksum 79e53ce7994d1628b240f09af91e1af4

It means your Hadoop’s standalone mode setup is working ﬁne. By default, Hadoop is con-

ﬁgured to run in a non-distributed mode on a single machine

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 19

CHAPTER 3. WORKING OF HADOOP

3.3.2 Installing Hadoop in Pseudo Distributed Mode

Step 1: Setting Up Hadoop

You can set Hadoop environment variables by appending the following commands to /.bashrc

ﬁle.

export HADOOP HOME=/usr/local/hadoop

export HADOOP MAPRED HOME=HADOOP HOME

exportHAD OOP COMMON HOME =HADOOP HOME

export HADOOP HDFS HOME=HADOOP HOME

exportY ARN HOME =HADOOP HOME

export HADOOP COMMON LIB NATIVE DIR=HADOOP HOME/lib/native

exportP AT H =PATH:HADOOP HOME/sbin :HADOOP HOME/bin

export HADOOP INSTALL=HADOOP HOME

Now apply all the changes into the current running system.

$ source ~/.bashrc

Step 2: Hadoop Conﬁguration

You can ﬁnd all the Hadoop conﬁguration ﬁles in the location $HADOOP HOME/etc/hadoop..

It is required to make changes in those conﬁguration ﬁles according to your Hadoop infras-

tructure.

$ cd HADOOP HOME/etc/hadoop

In order to develop Hadoop programs in java, you have to reset the java environment vari-

ables in hadoop-env.sh ﬁle by replacing JAVA HOME value with the location of java in your

system.

export JAVA HOME=/usr/local/jdk1.7.0 71

The following are the list of ﬁles that you have to edit to conﬁgure Hadoop.

core-site.xml

The core-site.xml ﬁle contains information such as the port number used for Hadoop in-

stance, memory allocated for the ﬁle system, memory limit for storing the data, and size of

Read/Write buﬀers.

Open the core-site.xml and add the following properties in between < configuration > ,

< /configuration > tags.

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 20

CHAPTER 3. WORKING OF HADOOP

Figure 3.3: Core-site.xml

hdfs-site.xml

The hdfs-site.xml ﬁle contains information such as the value of replication data, namenode

path, and datanode paths of your local ﬁle systems. It means the place where you want to

store the Hadoop infrastructure.

Figure 3.4: hdfs-site.xml

Open this ﬁle and add the following properties in between the < conf iguration > , <

/configuration > tags in this ﬁle.

Figure 3.5: hdfs-ﬁle.xml

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 21

CHAPTER 3. WORKING OF HADOOP

Note: In the above ﬁle, all the property values are user-deﬁned and you can make changes

according to your Hadoop infrastructure.

yarn-site.xml

This ﬁle is used to conﬁgure yarn into Hadoop. Open the yarn-site.xml ﬁle and add the

following properties in between the < configuration > , < /configuration > tags in this

ﬁle.

Figure 3.6: yarn-site.xml

mapred-site.xml

This ﬁle is used to specify which MapReduce framework we are using. By default, Hadoop

contains a template of yarn-site.xml. First of all, it is required to copy the ﬁle from mapred-

site,xml.template to mapred-site.xml ﬁle using the following command.

$ cp mapred-site.xml.template mapred-site.xml

Open mapred-site.xml ﬁle and add the following properties in between the < configur ation >

, < /configuration > tags in this ﬁle.

Figure 3.7: mapred-site.xml

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 22

CHAPTER 3. WORKING OF HADOOP

3.3.3 Verifying Hadoop Installation

The following steps are used to verify the Hadoop installation.

Step 1: Name Node Setup

Set up the namenode using the command namenode -format. as follows.

$ cd

$ hdfs namenode -format

The expected result is as follows.

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP MSG:

/************************************************************

STARTUP MSG: Starting NameNode

STARTUP MSG: host = localhost/192.168.1.11

STARTUP MSG: args = [-format]

STARTUP MSG: version = 2.4.1

...

10/24/14 21:30:56 INFO common.Storage: Storage directory

/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.

10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to

retain 1 images with txid >= 0

10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0

10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN MSG:

/************************************************************

SHUTDOWN MSG: Shutting down NameNode at localhost/192.168.1.11

************************************************************/

Step 2: Verifying Hadoop dfs

The following command is used to start dfs. Executing this command will start your Hadoop

ﬁle system.

$ start-dfs.sh

The expected output is as follows:

10/24/14 21:37:56

Starting namenodes on [localhost]

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 23

CHAPTER 3. WORKING OF HADOOP

localhost: starting namenode, logging to /home/hadoop/hadoop

2.4.1/logs/hadoop-hadoop-namenode-localhost.out

localhost: starting datanode, logging to /home/hadoop/hadoop

2.4.1/logs/hadoop-hadoop-datanode-localhost.out

Starting secondary namenodes [0.0.0.0]

Step 3: Verifying Yarn Script

The following command is used to start the yarn script. Executing this command will start

your yarn daemons.

$ start-yarn.sh

The expected output as follows:

starting yarn daemons starting resourcemanager, logging to /home/hadoop/hadoop

2.4.1/logs/yarn-hadoop-resourcemanager-localhost.out

localhost: starting nodemanager, logging to /home/hadoop/hadoop

2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Step 4: Accessing Hadoop on Browser

The default port number to access Hadoop is 50070. Use the following url to get Hadoop

services on browser.

http://localhost:50070/

Figure 3.8: Hadoop on Browser

Step 5: Verify All Applications for Cluster

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 24

CHAPTER 3. WORKING OF HADOOP

The default port number to access all applications of cluster is 8088. Use the following url

to visit this service.

http://localhost:8088/

Figure 3.9: Hadoop application on Cluster

3.4 Some essential Hadoop project

• Pig Latin: which is a programming language.

• Pig Runtime:which competes pig latin and converts it into map reduce job to submit

to cluster.

• Data Intelligence:We also have data intelligence in the form of Drill and Mahout.

• Drill:Drill is actually an incubator project and is designed to do interactive analysis

on nested data.

• Mahout: Mahout is a machine learning library that concurs the three Cs :

1. Recommendation (Collaborative Filtering)

2. Clustering (which is a way to group related text and documents)

3. Classiﬁcation (which is a way to categorize related text and documents)[3].

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 25

Chapter 4

Application Of Big Data Analytics

Using Hadoop

4.1 Application Of Big Data Analytics Using Hadoop

• Health Care Applications

• IOT

• Social Media

• Advertisement (Mining user behavior to generate recommendations)

• Searches (group related documents)

4.2 Health Care Applications

• Hadoop technology in Monitoring Patient Vitals

• Hadoop technology in Cancer Treatments and Genomics

• Hadoop technology in the Hospital Network

• Hadoop technology in Healthcare Intelligence

• Hadoop technology in Fraud Prevention and Detection[17].

CHAPTER 4. APPLICATION OF BIG DATA ANALYTICS USING HADOOP

4.2.1 Hadoop Technology In Monitoring Patient Vitals

There are several hospitals across the world that use Hadoop to help the hospital

staﬀ work eﬃciently with Big Data. Without Hadoop, most patient care systems could not

even imagine working with unstructured data for analysis.

Figure 4.1: Hadoop Technology In Monitoring Patient Vitals

Childrens Healthcare of Atlanta treats over 6,200 children in their ICU units. On

average, the duration of stay in Pediatric ICU varies from a month to a year. Childrens

Healthcare of Atlanta used a sensor beside the bed that helps them continuously track

patient signs such as blood pressure, heartbeat and the respiratory rate. These sensors

produce large chunks of data, which using legacy systems cannot be stored for more than

3 days for analysis.The main motive of Childrens Healthcare of Atlanta was to store and

analyze the vital signs. If there is any change in pattern, then the hospital wanted an alert

to be generated to a team of doctors and assistants. All this was successfully achieved using

Hadoop ecosystem components - Hive, Flume, Sqoop, Spark, and Impala[17].

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 27

Chapter 5

Advantages/Disadvantages of Hadoop

5.1 Advantages Of Hadoop

• Scalable: Hadoop is a highly scalable storage platform, because it can stores and

distribute very large data sets across hundreds of inexpensive servers that operate in

parallel.

• Flexible: Hadoop enables businesses to easily access new data sources and tap into

diﬀerent types of data (both structured and unstructured) to generate value from that

data.

• Cost eﬀective: Hadoop also oﬀers a cost eﬀective storage solution for businesses

exploding data sets.

• Resilient To Failure:A key advantage of using Hadoop is its fault tolerance. When

data is sent to an individual node, that data is also replicated to other nodes in the

cluster, which means that in the event of failure, there is another copy available for

use.

• Fast: The tools for data processing are often on the same servers where the data is

located, resulting in much faster data processing. If youre dealing with large volumes

of unstructured data, Hadoop is able to eﬃciently process terabytes of data in just

minutes, and petabytes in hours[12].

5.2 Disadvantages Of Hadoop

• Security Concerns: Just managing a complex applications such as Hadoop can be

challenging. A simple example can be seen in the Hadoop security model, which

is disabled by default due to sheer complexity. If whoever managing the platform

CHAPTER 5. ADVANTAGES/DISADVANTAGES OF HADOOP

lacks of know how to enable it, your data could be at huge risk. Hadoop is also

missing encryption at the storage and network levels, which is a major selling point

for government agencies and others that prefer to keep their data under wraps[12].

• Not Fit For Small Data: Due to its high capacity design, the Hadoop Distributed

File System, lacks the ability to eﬃciently support the random reading of small ﬁles.

As a result, it is not recommended for organizations with small quantities of data[12].

• Vulnerable By Nature: Speaking of security, the very makeup of Hadoop makes

running it a risky proposition. The framework is written almost entirely in Java, one

of the most widely used yet controversial programming languages in existence[12].

• Hadoop can perform only batch processing and sequential access.Sequential access is

time consuming.So a new technique is needed to get rid of this problem.

• Hadoop can perform only batch processing and sequential access.Sequential access is

time consuming.So a new technique is needed to get rid of this problem.

• Disk based processing: slow

• Many tools to enhance Hadoop’s capabilities.

• Not for interactive and iterative.

• Multiple copies of already big data.

• Lack of required skills.

• Very limited SQL support.

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 29

Chapter 6

Conclusion

6.1 Conclusion

We Studied the concept of Big Data along with 5 Vs,volume ,velocity ,variety

,value and veracity of Big Data. Today data is generated in large amount ,to process these

large amounts of data is a big issue. The paper describes Hadoop which is an open source

software used for processing of Big Data in detail. Apache Hadoop is open source, and pio-

neered a fundamentally new way of storing and processing data. Hadoop enables distributed

parallel processing of huge amounts of data across inexpensive, industry-standard servers

that both store and process the data, and can scale without limits. Hadoops breakthrough

advantages mean that businesses and organizations can now ﬁnd value in data that was

recently considered useless.

We also discussed some hadoop components which are used to support the pro-

cessing of large data sets in distributed computing environments.

Chapter 7

Bibliography

[1] Sethy, Rotsnarani, and Mrutyunjaya Panda ”Big Data Analysis using Hadoop:

A Survey.” International Journal 5.7 (2015).

[2] Bhosale, Harshawardhan S., and Devendra P. Gadekar. ”A Review Paper on Big

Data and Hadoop.” International Journal of Scientiﬁc and Research Publications

4.10 (2014): 1.

[3] ]http://research.ijcaonline.org/volume108/number12/pxc3900288.pdf

[4] https://en.wikipedia.org/wiki/Big data

[5] Tom White,.Hadoop, The deﬁnitive guide.,OfReilly,3rd Edition

[6] https://www.google.co.in/?gfe rd=cr&ei=ayKnWJWmDe x8AfDyLnQDg&gws rd=ssl#q

= hadoop + tutorial + ppt

[7] https://www.google.co.in/?gfe rd=cr&ei=ayKnWJWmDe x8AfDyLnQDg&gws rd=ssl#q

= hadoop

[8] Bernice Purcell The emergence of gbig datah technology and analytics Journal

of Technology Research 2013.

[9] https://www.google.co.in/search?q=Hadoop%2 C + a + distributed + framework +

for + Big + Data &ie=utf-8&oe¯utf-8 &client = firefox − b − ab&gfe rd = cr&ei =

glXJW JyDMIKM4gL89IP ACg

[10] Gupta, Bhawna, and Kiran Jyoti.”Big data analytics with hadoop to analyze

targeted attacks on enterprise data.” (IJCSIT) International Journal of Com-

puter Science and Information Technologies 5.3 (2014): 3867-3870.

CHAPTER 7. BIBLIOGRAPHY

[11] Russom, Philip.”Big data analytics.” TDWI best practices report, fourth quarter

(2011): 1-35.

[12] http://blogs.mindsmapped.com/bigdatahadoop/hadoop-advantages-and-disadvantages/

[13] http://www.tutorialspoint.com/articles/ what − is − nosql − and − is − it − the −

next − big − trend − in − databases

[14] http://www.tutorialspoint.com/MongoDB/MongoDB-Application.htm

[16] http://www.w3resource.com/mongodb/nosql.php

[17] https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85

[18] https://www.tutorialspoint.com/hadoop/hadoop.htm

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad, Dist. Nashik. 32