What is Hadoop?
Apache Hadoop is an open-source software which was born out of a need to process an avalanche (sudden arrival or occurrence of something in overwhelming quantities) of big data.
The web was generating more and more information on a daily basis, and it was becoming very difficult to index over one billion pages of content. Hadoop is for storage and large scale processing of data.
Basically, it’s a way of storing enormous data sets across distributed clusters of servers and then running “distributed” analysis applications in each cluster.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005.Named after his son’s yellow color toy elephant named as Hadoop.
Normal DB – store data, retrieve data using SQL Query
Hadoop – stores data, retrieve data but no queries involved
Two main parts – a distributed file system for data storage and a data processing framework.
Distributed file-system – HDFS
Data processing framework – MapReduce
HDFS writes data once to the server and then reads and reuses it many times.
Namenode – master
Datanode – slaves
Input data was splitted into blocks (equal size 128kbs) and stored in Datanode. Name node determines the mapping of blocks to the DataNodes.
MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster.
Map – which takes a set of input data and converts it into another set of data tuples (Key & values),
Reduce – takes the output from a map as input and combines those data and produce the final output.
MAP – Job tracker – that splits the job submitted by the client into small sub-tasks. (do job scheduling & task tracker monitoring)
Reduce – Task tracker – that actually do the tasks in parallel in a distributed manner on data stored in datanodes.
Instead of MapReduce, we can use querying tools like Pig Hadoop and Hive Hadoop that gives the data hunters strong power and flexibility.
Example for MapReduce : Find maximum temperature of each City
YARN : Used in Hadoop V2. (Advanced of MapReduce)
Apache Yarn – “Yet Another Resource Negotiator” is the resource management layer of Hadoop V2.
Yarn allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS simultaneously with Mapreduce.
YARN Architecture :
Resource Manager – (Master of YARN) (Similar to Job tracker)
RM manages the global assignments of resources (CPU and memory) among all the applications. It arbitrates system resources between competing applications.
The scheduler is responsible for allocating the resources to the running application.
- Application manager
It manages running Application in the cluster, i.e., it is responsible for starting application masters and for monitoring and restarting them on different nodes in case of failures.
Node Manager – (Slave of YARN) (Similar to Task tracker)
NM is responsible for containers, monitoring their resource usage and reporting the same to the Resource Manager. Manage the user process on that machine.
One application master runs per application. It negotiates resources from the resource manager and works with the node manager. It manages the resource needs of an application.
Advantages of Hadoop :
1. Resilient to failure – In HDFS data was stored on every machine in the cluster. If any machine slows down data can be retrieved from another.
2. Speedy Output – Output can be fetched in a fraction of a second.
3. Cost effective – Bigdata can be stored and processed in a very effective cost.
4. Scalable – Data can be fetched from hundreds or thousands of servers.
Limitations of Hadoop :
1. Missing encryption – Data can be read by everyone.
2. Code are in Java – Java has been heavily exploited by cybercriminals.
3. Lacks the ability to efficiently support small files.
Companies currently using Hadoop :