Welcome!

Apache Authors: Pat Romanski, Liz McMillan, Elizabeth White, Christopher Harrold, Janakiram MSV

Related Topics: @BigDataExpo, Java IoT, Open Source Cloud, @CloudExpo, Apache, SDN Journal

@BigDataExpo: Article

The Components of Apache Hadoop

A technical description of the projects that comprise Hadoop

What is Hadoop?
Following my high-level write-up of Hadoop and Big Data, this article will present each of the components or projects that make up Hadoop with a technical description of each.

First, what is Hadoop?

Hadoop stores and processes large volumes of a wide variety of data that changes rapidly. It analyses and summarizes the data. For example: census of a city, web page analytics, threat analysis, risk models, network failures, etc.

Hadoop is redundant and reliable, powerful and focused on batch processing.

Hadoop divides a large data processing job into many smaller tasks that can be distributed across all the nodes

Hadoop comprises two main components:

  • MapReduce: The task to analyse the data and summarize the results
  • HDFS: The distributed file system, on commodity server hardware, that contains the data.

On each server there is a task tracker and a data node:

DataNode
The data node stores the data in HDFS and keeps track of access to the data.

TaskTracker
Task tracker launches a map reduce job on a node and manages the many tasks within one MapReduce job. So if my project was to conduct a census count, task tracker may count the members of a household on a data node. When finshed, task tracker reports its status to the job tracker. (Note: as of this writing, May 2013, TaskTracker is being obsoleted and replaced by "Yarn" in MapReduce v2.

JobTracker
Job tracker keeps track of all the jobs being executed and tries to schedule each map job as close to the actual data being processed. If a task has failed or disappeared perhaps due to hardware failure, job tracker will assign that task to another node.

So, now that I know what is a task and job how do I write tasks? How does a user create a map reduce job? There are various projects that make it easy. (As to how the projects were named, don't ask me!)

Apache Pig
To write a computer program, a software engineer might use a compiler, like "C",  that compiles 'pseudo english instructions (IF, THEN, FOR, ELSE) and creates machine code that a computer an execute. Similarly, Apache Pig is a high level language that expresses data map reduce jobs and translates them to JAVA computer language. Pig's primary feature is that it can be run in parallel, meaning many map reduce jobs can run simultaneously to allow linear scaling and efficiency.

Apache Hive
Hive is a SQL like language, HiveQL, which allows you to define computation in SQL like language and then and translate it down into map reduce JAVA code. Hive also allows traditional MapRedce programmers to plug in their custom MapReducers when it is  inefficient to express their logic in HiveQL.

hBase
hBase is a simple interface to distributed data that allows incremental processing. hBase stores its information in HDFS and metadata in zookeeper.

hCatalog
hCatalog is an abstraction layer for referencing data without using the underlying file­names or formats. It insulates users and scripts from how and where the data is physically stored.

Some of the smaller projects

Mahout
Mahout is a machine learning library to write MapReduce applications focused on machine learning

Ambari, Gagli and Nagios
These projects help you understand what goes on in your cluster

Scoop
Scoop is a tool that lets you run map reduce applications to or from sql databases

Oozie
Oozie is a workflow that triggers MapReduce jobs and executes them automatically or launches when new data becomes available.

Flume
Streams inputs into hadoop and gets that data loaded into hdfs

Here is a graphical view of the components

Hadoop components, courtesy of Hortonworks

(courtesy of Hortonworks)

More Stories By Jonathan Gershater

Jonathan Gershater has lived and worked in Silicon Valley since 1996, primarily doing system and sales engineering specializing in: Web Applications, Identity and Security. At Red Hat, he provides Technical Marketing for Virtualization and Cloud. Prior to joining Red Hat, Jonathan worked at 3Com, Entrust (by acquisition) two startups, Sun Microsystems and Trend Micro.

(The views expressed in this blog are entirely mine and do not represent my employer - Jonathan).

@ThingsExpo Stories
SYS-CON Events announced today that App2Cloud will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. App2Cloud is an online Platform, specializing in migrating legacy applications to any Cloud Providers (AWS, Azure, Google Cloud).
IoT is at the core or many Digital Transformation initiatives with the goal of re-inventing a company's business model. We all agree that collecting relevant IoT data will result in massive amounts of data needing to be stored. However, with the rapid development of IoT devices and ongoing business model transformation, we are not able to predict the volume and growth of IoT data. And with the lack of IoT history, traditional methods of IT and infrastructure planning based on the past do not app...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. Jack Norris reviews best practices to show how companies develop, deploy, and dynamically update these applications and how this data-first...
Internet-of-Things discussions can end up either going down the consumer gadget rabbit hole or focused on the sort of data logging that industrial manufacturers have been doing forever. However, in fact, companies today are already using IoT data both to optimize their operational technology and to improve the experience of customer interactions in novel ways. In his session at @ThingsExpo, Gordon Haff, Red Hat Technology Evangelist, shared examples from a wide range of industries – including en...
Intelligent Automation is now one of the key business imperatives for CIOs and CISOs impacting all areas of business today. In his session at 21st Cloud Expo, Brian Boeggeman, VP Alliances & Partnerships at Ayehu, will talk about how business value is created and delivered through intelligent automation to today’s enterprises. The open ecosystem platform approach toward Intelligent Automation that Ayehu delivers to the market is core to enabling the creation of the self-driving enterprise.
"We're a cybersecurity firm that specializes in engineering security solutions both at the software and hardware level. Security cannot be an after-the-fact afterthought, which is what it's become," stated Richard Blech, Chief Executive Officer at Secure Channels, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Consumers increasingly expect their electronic "things" to be connected to smart phones, tablets and the Internet. When that thing happens to be a medical device, the risks and benefits of connectivity must be carefully weighed. Once the decision is made that connecting the device is beneficial, medical device manufacturers must design their products to maintain patient safety and prevent compromised personal health information in the face of cybersecurity threats. In his session at @ThingsExpo...
SYS-CON Events announced today that Grape Up will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company specializing in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the U.S. and Europe, Grape Up works with a variety of customers from emergi...
Detecting internal user threats in the Big Data eco-system is challenging and cumbersome. Many organizations monitor internal usage of the Big Data eco-system using a set of alerts. This is not a scalable process given the increase in the number of alerts with the accelerating growth in data volume and user base. Organizations are increasingly leveraging machine learning to monitor only those data elements that are sensitive and critical, autonomously establish monitoring policies, and to detect...
SYS-CON Events announced today that Massive Networks will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Massive Networks mission is simple. To help your business operate seamlessly with fast, reliable, and secure internet and network solutions. Improve your customer's experience with outstanding connections to your cloud.
Everything run by electricity will eventually be connected to the Internet. Get ahead of the Internet of Things revolution and join Akvelon expert and IoT industry leader, Sergey Grebnov, in his session at @ThingsExpo, for an educational dive into the world of managing your home, workplace and all the devices they contain with the power of machine-based AI and intelligent Bot services for a completely streamlined experience.
Because IoT devices are deployed in mission-critical environments more than ever before, it’s increasingly imperative they be truly smart. IoT sensors simply stockpiling data isn’t useful. IoT must be artificially and naturally intelligent in order to provide more value In his session at @ThingsExpo, John Crupi, Vice President and Engineering System Architect at Greenwave Systems, will discuss how IoT artificial intelligence (AI) can be carried out via edge analytics and machine learning techn...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, will examine the regulations and provide insight on how it affects technology, challenges the established rules and will usher in new levels of diligence a...
In the enterprise today, connected IoT devices are everywhere – both inside and outside corporate environments. The need to identify, manage, control and secure a quickly growing web of connections and outside devices is making the already challenging task of security even more important, and onerous. In his session at @ThingsExpo, Rich Boyer, CISO and Chief Architect for Security at NTT i3, discussed new ways of thinking and the approaches needed to address the emerging challenges of security i...
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics ...
SYS-CON Events announced today that Datera will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera offers a radically new approach to data management, where innovative software makes data infrastructure invisible, elastic and able to perform at the highest level. It eliminates hardware lock-in and gives IT organizations the choice to source x86 server nodes, with business model option...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, Cloud Expo and @ThingsExpo are two of the most important technology events of the year. Since its launch over eight years ago, Cloud Expo and @ThingsExpo have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, I provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading the...
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...