Saturday, June 6, 2020

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as well. Here, the job essentially consists of a set of program or programs that needs to manipulate a certain piece of data.So, a job is usually associated with certain parameters like one important thing of a job is that it is going to be a set of programs. Here in the Hadoop ecosystem, it’s going to be a Map Reduce program typically written in Java. However, there are also non-Java based jobs which can be submitted into the Hadoop cluster which we shall discuss in detail later.




The important parameters which actually identify or make up a job is going to be the programs and the piece of data on which the program has to work on , which is going to be the input file path and the output file or the output directory to which the output of the program’s execution needs to be stored. So, a job in a Hadoop ecosystem is going to be a combination of all the three.

When we specifically speak about a job in terms of a Java MapReduce job, the programs are going to be submitted into the cluster in the form of a jar file which is going to be a packaging of all the classes which constitute the program which needs to execute on a particular set of data.

The two major features what Hadoop actually provides to the users.

The first important feature is a distributed file system which is called as the Hadoop Distributed File System or the HDFS. Basically, once you submitted into the Hadoop cluster it’s going to be residing on multiple data nodes where the original file would be divided into smaller pieces. So, this service actually is called as the Hadoop Distributed File System.

Second most important feature of Hadoop is that it provides a robust parallel processing framework which is called as a MapReduce Framework. So, once you submit a job into the Hadoop cluster, the program which is supposed to execute on a piece of data is now going to be running on multi machines wherever the data corresponding to the input file on which the Hadoop program is supposed to manipulate. So, the two core services or the two core features what Hadoop actually comes very handy is that the Hadoop Distributed File System and then the Parallel Processing Framework called as a MapReduce. p { margin-bottom: 0.25cm; line-height: 115% }

Saturday, May 30, 2020

Hadoop - Cluster Set Up and RAID

The Hadoop cluster mainly resides in steel frame that resembles the modern day cup boards.The only difference between these rack and cupboard is that the rack does not have metal enclosure on all the sides like your traditional metal cupboards. Indeed, there would just like open cupboards where you would have holes on these pillar like structures in the racks for you to be able to bolt on the data nodes or basically the data node over here are just a chassis only machines.

A rack would be consisting a form like anywhere between 25 to 50 nodes and it’s not these racks just accommodate the data node. These racks can also accommodate the name node, the secondary name node or your job tracker.



It is just that is an open frame like structure where these chassis can be bolted upon and all of these machines within a rack will be able to communicate with the help of a switch at the speed with usually be around like 10 gigabytes per second and multiple racks would be able to communicate with the help of a multi-layer switch or an uplink switch which also does the functionality of that of a router.Usually the data transfer speed between the machine in the same rack is much higher than the data transfer speed between the machines across different racks.

Most of the data nodes would be having 2 hex core processor or 2 OctaCore processor meaning 8 Cores. There will be 2 CPU’s, each of them 8 cores and processing speed anywhere between 2.4 to 3.5 gigahertz CPU. the highest speed and the amount of RAM required typically varies according to the organizational needs.Usually your name node and Job tracker would be having higher RAM than that of your data nodes so the data node over here typically would be having anywhere between 50 to 500 GB of high speed RAM . And the frequency speed is somewhere around greater than 2000 megahertz and for the storage, the general formula like how much of hard drive storage you need for each data nodes is done by the rule of thumb that. If for every single core of CPU, you would be requiring at least 2 Terabyte of hard drive.


The question is why don’t we use RAID redundant array in expensive disc which has been since more than a decade? Why can't be used that in place of a distributed storage system as in the case of HDFS in Hadoop.

The reason being HDFS cluster do not benefit from using RAID for data nodes especially because your data nodes handle replication across several nodes which is a built in functionality of your HDFS and although RAID can actually be used for name node disc and secondary name node disc to just act as a backup against the fail over or the corruption of the meta data. Even if a disc fails in case of your Hadoop cluster your HDFS can continue to operate without the fail disc but that's not the case with the RAID, so due to these reasonsRAID is not the preferred choice out here in HDFS. p { margin-bottom: 0.25cm; line-height: 115% }

Saturday, May 23, 2020

Hadoop - Master Slave Architecture


Suppose we have a hadoop cluster of 1000 nodes and let us say out of which 3 nodes would predominantly be operating in a master mode that is the name node, secondary name node, and the Job tracker .The remaining 1000 minus 3, that is your 997 machines are going to be working in the slave mode and they are going to be the data nodes.We should note that the machines which is working in a slave mode is your data nodes and the machines which is working in the master mode which will be your Name node, Secondary Name Node and Job Tracker.


However, They actually vary slightly in terms of their hardware configurations. For example, the name node, the secondary name node and Job tracker need not have a very high hard drive storage space whereas your data nodes will be very high in terms of your Hard Drive storage space because there the once which actually is going to bare all the data loads in terms of all your big data storage is going to happen on these data nodes which is going to bare the bulk of the data operations.

It is also important to understand that a host operating system or a native operating system would be install on all of these machines, both which is working as data nodes in slave mode as well as the machines in master mode. The native operating system 90% of the time it‘s going to be a Linux based operating system and in some cases we do have Windows based operating system which is installed on these machines.


It is on this native operating system that Hadoop as a piece of software framework is going to be installed. We should note that , Hadoop would be installed on all of these machines on name node, secondary name node, job tracker as well as in each one of the data nodes. And the differentiating factor between the master mode and the machines in the slave mode is that it is the software configuration after installing Hadoop which would enable each of these machines to actually perform responsibilities associated with that of a name node, and that of a secondary name node and that of a job tracker.

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...