Welcome!

Apache Authors: Elizabeth White, Pat Romanski, Liz McMillan, Christopher Harrold, Janakiram MSV

Related Topics: @DXWorldExpo, Java IoT, Open Source Cloud, @CloudExpo, Apache, SDN Journal

@DXWorldExpo: Article

The Components of Apache Hadoop

A technical description of the projects that comprise Hadoop

What is Hadoop?
Following my high-level write-up of Hadoop and Big Data, this article will present each of the components or projects that make up Hadoop with a technical description of each.

First, what is Hadoop?

Hadoop stores and processes large volumes of a wide variety of data that changes rapidly. It analyses and summarizes the data. For example: census of a city, web page analytics, threat analysis, risk models, network failures, etc.

Hadoop is redundant and reliable, powerful and focused on batch processing.

Hadoop divides a large data processing job into many smaller tasks that can be distributed across all the nodes

Hadoop comprises two main components:

  • MapReduce: The task to analyse the data and summarize the results
  • HDFS: The distributed file system, on commodity server hardware, that contains the data.

On each server there is a task tracker and a data node:

DataNode
The data node stores the data in HDFS and keeps track of access to the data.

TaskTracker
Task tracker launches a map reduce job on a node and manages the many tasks within one MapReduce job. So if my project was to conduct a census count, task tracker may count the members of a household on a data node. When finshed, task tracker reports its status to the job tracker. (Note: as of this writing, May 2013, TaskTracker is being obsoleted and replaced by "Yarn" in MapReduce v2.

JobTracker
Job tracker keeps track of all the jobs being executed and tries to schedule each map job as close to the actual data being processed. If a task has failed or disappeared perhaps due to hardware failure, job tracker will assign that task to another node.

So, now that I know what is a task and job how do I write tasks? How does a user create a map reduce job? There are various projects that make it easy. (As to how the projects were named, don't ask me!)

Apache Pig
To write a computer program, a software engineer might use a compiler, like "C",  that compiles 'pseudo english instructions (IF, THEN, FOR, ELSE) and creates machine code that a computer an execute. Similarly, Apache Pig is a high level language that expresses data map reduce jobs and translates them to JAVA computer language. Pig's primary feature is that it can be run in parallel, meaning many map reduce jobs can run simultaneously to allow linear scaling and efficiency.

Apache Hive
Hive is a SQL like language, HiveQL, which allows you to define computation in SQL like language and then and translate it down into map reduce JAVA code. Hive also allows traditional MapRedce programmers to plug in their custom MapReducers when it is  inefficient to express their logic in HiveQL.

hBase
hBase is a simple interface to distributed data that allows incremental processing. hBase stores its information in HDFS and metadata in zookeeper.

hCatalog
hCatalog is an abstraction layer for referencing data without using the underlying file­names or formats. It insulates users and scripts from how and where the data is physically stored.

Some of the smaller projects

Mahout
Mahout is a machine learning library to write MapReduce applications focused on machine learning

Ambari, Gagli and Nagios
These projects help you understand what goes on in your cluster

Scoop
Scoop is a tool that lets you run map reduce applications to or from sql databases

Oozie
Oozie is a workflow that triggers MapReduce jobs and executes them automatically or launches when new data becomes available.

Flume
Streams inputs into hadoop and gets that data loaded into hdfs

Here is a graphical view of the components

Hadoop components, courtesy of Hortonworks

(courtesy of Hortonworks)

More Stories By Jonathan Gershater

Jonathan Gershater has lived and worked in Silicon Valley since 1996, primarily doing system and sales engineering specializing in: Web Applications, Identity and Security. At Red Hat, he provides Technical Marketing for Virtualization and Cloud. Prior to joining Red Hat, Jonathan worked at 3Com, Entrust (by acquisition) two startups, Sun Microsystems and Trend Micro.

(The views expressed in this blog are entirely mine and do not represent my employer - Jonathan).

IoT & Smart Cities Stories
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.
"MobiDev is a Ukraine-based software development company. We do mobile development, and we're specialists in that. But we do full stack software development for entrepreneurs, for emerging companies, and for enterprise ventures," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound e...
As IoT continues to increase momentum, so does the associated risk. Secure Device Lifecycle Management (DLM) is ranked as one of the most important technology areas of IoT. Driving this trend is the realization that secure support for IoT devices provides companies the ability to deliver high-quality, reliable, secure offerings faster, create new revenue streams, and reduce support costs, all while building a competitive advantage in their markets. In this session, we will use customer use cases...
Bill Schmarzo, author of "Big Data: Understanding How Data Powers Big Business" and "Big Data MBA: Driving Business Strategies with Data Science," is responsible for setting the strategy and defining the Big Data service offerings and capabilities for EMC Global Services Big Data Practice. As the CTO for the Big Data Practice, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He's written several white papers, is an avid blogge...
When talking IoT we often focus on the devices, the sensors, the hardware itself. The new smart appliances, the new smart or self-driving cars (which are amalgamations of many ‘things'). When we are looking at the world of IoT, we should take a step back, look at the big picture. What value are these devices providing. IoT is not about the devices, its about the data consumed and generated. The devices are tools, mechanisms, conduits. This paper discusses the considerations when dealing with the...
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
If a machine can invent, does this mean the end of the patent system as we know it? The patent system, both in the US and Europe, allows companies to protect their inventions and helps foster innovation. However, Artificial Intelligence (AI) could be set to disrupt the patent system as we know it. This talk will examine how AI may change the patent landscape in the years to come. Furthermore, ways in which companies can best protect their AI related inventions will be examined from both a US and...
Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.