Click here to close now.

Welcome!

Apache Authors: Carmen Gonzalez, Pat Romanski, Bob Gourley, Elizabeth White, Mark R. Hinkle

Related Topics: Cloud Expo, Microservices Journal, Open Source, Virtualization, Web 2.0, Apache

Cloud Expo: Article

The Age of Big Data: How to Gain Competitive Advantage

The Drivers Behind Hadoop Adoption

We have entered the "Age of Big Data" according to a recent New York Times article. This comes as no surprise to most organizations already struggling with the onslaught of data coming from an increasing number of sources and at an increasing rate. The 2011 IDC Digital Universe Study reported that data is growing faster than Moore's Law. This trend points to a paradigm shift in how organizations process data where isolated islands and silos are being replaced by large clusters of commodity servers that keep data and compute resources together.

Another way of looking at this paradigm shift is that the growing volume and velocity of data require a new approach to networked computing. A good example of this change is found at Google. The industry now takes Google's dominance for granted, but when Google launched its beta search engine in 1998, the company was late entering the market. At the time, Yahoo! was dominant; other contenders included infoseek, excite, Lycos, Ask Jeeves and AltaVista (dominating technical searches). Within two years, Google was the dominant search provider. It wasn't until 2003, when Google published a paper on MapReduce, that the world got a glimpse into Google's back-end architecture.

Google's architecture revealed how the company was able to index significantly more data, to get far better results faster, and to achieve these superior results much more efficiently and cost-effectively than all competitors. The shift Google made was to divide complex data analysis tasks into simple subtasks that could be performed in parallel on commodity servers. Separate processes were being used to Map the data, and then Reduce it into interim or final results. This MapReduce framework would eventually become available to organizations through distributions of Apache Hadoop.

A Brief History of Hadoop
After reading Google's paper in 2003, Yahoo engineer Doug Cutting developed a Java-based implementation of MapReduce, and named it after his son's stuffed elephant, Hadoop. In 2006, Hadoop became a subproject of Lucene (a popular text search library) at the Apache Software Foundation (www.apache.org), and became its own top-level Apache project in 2008.

Essentially, Hadoop provides a way to capture, organize, store, search, share, analyze and visualize disparate data sources (structured, semi-structured and unstructured) across a large cluster of commodity computers, and is designed to scale up from dozens to thousands of servers, each offering local computation and storage.

While there are several elements that are now part of Hadoop, two are fundamental to its operation. The first is the Hadoop Distributed File System (HDFS), which serves as the primary storage system. HDFS replicates and distributes the blocks of source data to the compute nodes throughout the cluster of servers to be analyzed by one or more applications. The second is MapReduce, which creates a software framework and a programming model for writing applications capable of processing vast amounts of distributed data in parallel on very large clusters.

The open source nature of Apache Hadoop creates an ecosystem that facilitates constant advancements in its capabilities, performance, reliability and ease of use. These enhancements can be made by any individual or organization-a global community of contributors-and are then either contributed to the basic Apache library or made available in a separate (often free) commercial distribution.

In effect, Hadoop is a complete system or "stack" for data analysis. The stack includes not only the HDFS and MapReduce foundation, but also job management, development tools, schedulers, machine learning libraries, etc.

KISS: Keep It Simple, Scalable
In a paper titled The Unreasonable Effectiveness of Data, the authors (all research directors from Google) make a contrast between the elegant simplicity of physics (with equations like E = mc2) and other disciplines, noting that, "... sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics."

The fact that simple formulas are fully capable of explaining the complex natural world, while remaining elusive in understanding human behavior, is fundamental to why Hadoop is gaining in popularity. The paper notes the frustration of economists, who lack similar simple equations or models, and explores advances being made in fields like natural language processing-a notoriously complex area that has been studied for years with many attempts at artificial intelligence as a means to gain some insight.

The authors found that relatively simple algorithms applied to massive datasets produced stunning results. One example involves scene completion. An algorithm was used to eliminate something in a picture, a car for instance, and then based on a corpus of thousands of pictures, fill in the missing background. The algorithm performed rather poorly until the corpus was increased to millions of photos. With sufficient data, the same, simple algorithm performed extremely well. This need to find patterns and fill in the "missing pieces" in any puzzle is a common theme in many data analytics applications today.

Data analytics also confronts another inherent complexity: the growth in unstructured and semi-structured data. The sources of unstructured data, such as log files, social media, videos, etc., are growing in both their size and importance. But even structured data that goes through a series of changes eventually loses some or all of its structure. Traditional analytic techniques require considerable preprocessing of unstructured and semi-structured data before being able to produce results, and the results can be wrong or misleading if the preprocessing is somehow flawed.

The ability of Hadoop to employ simple algorithms and obtain meaningful results when analyzing unstructured, semi-structured and structured data in its raw form is unprecedented-and currently unparalleled. MapReduce enables data to be analyzed in an incremental fashion (and with parallel processing) without any need to engage in complex data transformations or to otherwise preprocess any data sources, or to create any schemas or aggregate any data in advance. Sometimes the interim results can be quite revealing on their own, and any unexpected results can be used to further fine-tune additional analysis. In fact, Hadoop was designed to accommodate virtually all forms of data directly, thus eliminating the need to engage in extraordinary measures before being able to unlock the value hidden deeply within.

The Price/Performance of Data Analytics
Not only does Hadoop deliver superior data analytics capabilities and results, it does so (as Google found) with an infrastructure that is far more cost-effective than traditional data analysis tools. The reason is that scaling data analytics capabilities has long been subject to the 80/20 rule: Big gains can be achieved with little initial effort (and cost), but the returns diminish as the datasets grow to become Big Data.

In stark contrast, Hadoop can scale linearly, which is the key to both effective and cost-effective data analytics. As datasets grow, traditional data analysis environments scale in an exponential fashion, causing the additional cost required to gain additional insight to eventually become prohibitive. With Hadoop, by contrast, the cluster of commodity (read: inexpensive) servers with direct-attached storage scales linearly with the growth in the number and sizes of datasets.

Hadoop's ability to satisfy these prerequisites well is the reason for its growing popularity in Web-based businesses and data-intensive organizations, as well as at aggressive start-ups. For the former, the need to wrestle with truly Big Data justifies the need for a data analytics environment like Hadoop. For the latter, the lack of anything legacy makes it easy to benefit from Hadoop's advantages.

One major challenge to Hadoop adoption, however, remains its file system. HDFS is an append-only storage that requires data to be batch loaded in a Hadoop cluster and then later exported post-processing for use by other applications that don't support the HDFS API. And Big Data can be difficult and costly to move back and forth in this fashion owing to the inherent duplication of data across the "semantic wall" between the existing and Hadoop infrastructures.

Another barrier to production adoption of Hadoop in larger organizations involves the extraordinary measures required to make the environment dependable. Constant care is needed to ensure that single points of failure (especially in the NameNode and JobTracker) cannot cause catastrophe, and that in the case of data loss, data can be re-loaded into the Hadoop cluster.

Breaking Through the Barriers
These problems with Hadoop are, themselves, becoming part of the past. Open source communities can be quite large, creating a vibrant ecosystem. This is the case with Hadoop, where several companies are now providing commercial distributions based on open source Hadoop.

The growing number of commercial Hadoop distributions available is systematically breaking through the barriers to widespread adoption. In general, these distributions provide enhancements that make Hadoop easier to integrate into the enterprise, as well as more enterprise-class in its operation, performance and reliability. One way of achieving these enhancements is to use existing and standard communications protocols as a foundation to enable more seamless integration between legacy and Hadoop environments.

Such a common foundation facilitates making the paradigm shift in data analytics in virtually any organization. It eliminates the need to throw data back and forth over a "semantic wall" by tearing down that wall. The compatibility afforded also extends beyond the physical infrastructure and into development environments and routine operating procedures, especially those involving data protection, such as snapshots and mirroring. With standards-based file access into the Hadoop cluster, existing applications and tools, and even ordinary browsers are able to access the data directly and in real-time (vs. Hadoop's traditional batch processing.

The End - or Just the Beginning
The data analytics paradigm is changing, and the change presents a real opportunity for established organizations to take full advantage of some new and powerful capabilities without sacrificing any existing ones. Just as Google was able to do, Hadoop makes it possible for any organization to gain a significant competitive edge by taking full advantage of the insight provided by this paradigm shift.

Hadoop is indeed a game-changing technology, and Hadoop is now itself changing with the advent of enterprise-class commercial distributions. By making Hadoop more mission-critical in its operation (potentially with the same or an even lower total cost of ownership), these "next-generation" solutions make beginning the shift to the new data analytics paradigm less risky and more rewarding than ever before.

More Stories By Jack Norris

Jack Norris is vice president, marketing, MapR Technologies. He has over 20 years of enterprise software marketing experience. He leads worldwide marketing for the industry’s most advanced distribution for Hadoop. Jack’s experience ranges from defining new markets for small companies, leading marketing and business development for an early-stage cloud storage software provider, to increasing sales of new products for large public companies. Jack has also held senior executive roles with Brio Technology, SQRIBE, EMC, Rainfinity, and Bain and Company.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
SYS-CON Events announced today that Cisco, the worldwide leader in IT that transforms how people connect, communicate and collaborate, has been named “Gold Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Cisco makes amazing things happen by connecting the unconnected. Cisco has shaped the future of the Internet by becoming the worldwide leader in transforming how people connect, communicate and collaborate. Cisco and our partners are building the platform for the Internet of Everything by connecting the...
SYS-CON Events announced today that Liaison Technologies, a leading provider of data management and integration cloud services and solutions, has been named "Silver Sponsor" of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York, NY. Liaison Technologies is a recognized market leader in providing cloud-enabled data integration and data management solutions to break down complex information barriers, enabling enterprises to make smarter decisions, faster.
SYS-CON Events announced today that Windstream, a leading provider of advanced network and cloud communications, has been named “Silver Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York, NY. Windstream (Nasdaq: WIN), a FORTUNE 500 and S&P 500 company, is a leading provider of advanced network communications, including cloud computing and managed services, to businesses nationwide. The company also offers broadband, phone and digital TV services to consumers primarily in rural areas.
SYS-CON Events announced today that Ciqada will exhibit at SYS-CON's @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Ciqada™ makes it easy to connect your products to the Internet. By integrating key components - hardware, servers, dashboards, and mobile apps - into an easy-to-use, configurable system, your products can quickly and securely join the internet of things. With remote monitoring, control, and alert messaging capability, you will meet your customers' needs of tomorrow - today! Ciqada. Let your products take flight. For more inform...
SYS-CON Events announced today that GENBAND, a leading developer of real time communications software solutions, has been named “Silver Sponsor” of SYS-CON's WebRTC Summit, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. The GENBAND team will be on hand to demonstrate their newest product, Kandy. Kandy is a communications Platform-as-a-Service (PaaS) that enables companies to seamlessly integrate more human communications into their Web and mobile applications - creating more engaging experiences for their customers and boosting collaboration and productiv...
SYS-CON Events announced today that Dyn, the worldwide leader in Internet Performance, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Dyn is a cloud-based Internet Performance company. Dyn helps companies monitor, control, and optimize online infrastructure for an exceptional end-user experience. Through a world-class network and unrivaled, objective intelligence into Internet conditions, Dyn ensures traffic gets delivered faster, safer, and more reliably than ever.
SYS-CON Events announced today that Open Data Centers (ODC), a carrier-neutral colocation provider, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. Open Data Centers is a carrier-neutral data center operator in New Jersey and New York City offering alternative connectivity options for carriers, service providers and enterprise customers.
SYS-CON Events announced today that BroadSoft, the leading global provider of Unified Communications and Collaboration (UCC) services to operators worldwide, has been named “Gold Sponsor” of SYS-CON's WebRTC Summit, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. BroadSoft is the leading provider of software and services that enable mobile, fixed-line and cable service providers to offer Unified Communications over their Internet Protocol networks. The Company’s core communications platform enables the delivery of a range of enterprise and consumer calling...
SYS-CON Events announced today that On the Avenue Marketing Group, a sales and marketing firm that utilizes events to market and sell products to consumers, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. On the Avenue Marketing Group (OTA) is a sales and marketing firm that utilizes events to market and sell products to consumers. On behalf of our clients, we attend thousands of fairs, festivals, expos, concerts, conferences, and sporting events annually, helping them reach millions of individuals ...
SYS-CON Events announced today that ActiveState, the leading independent Cloud Foundry and Docker-based PaaS provider, has been named “Silver Sponsor” of SYS-CON's DevOps Summit New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. ActiveState believes that enterprises gain a competitive advantage when they are able to quickly create, deploy and efficiently manage software solutions that immediately create business value, but they face many challenges that prevent them from doing so. The Company is uniquely positioned to help address these challenges thro...
SYS-CON Events announced today that Vitria Technology, Inc. will exhibit at SYS-CON’s @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Vitria will showcase the company’s new IoT Analytics Platform through live demonstrations at booth #330. Vitria’s IoT Analytics Platform, fully integrated and powered by an operational intelligence engine, enables customers to rapidly build and operationalize advanced analytics to deliver timely business outcomes for use cases across the industrial, enterprise, and consumer segments.
SYS-CON Events announced today that Alert Logic, the leading provider of Security-as-a-Service solutions for the cloud, has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo® and DevOps Summit 2015 New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY, and the 17th International Cloud Expo® and DevOps Summit 2015 Silicon Valley, which will take place November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Akana, formerly SOA Software, has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. Akana’s comprehensive suite of API Management, API Security, Integrated SOA Governance, and Cloud Integration solutions helps businesses accelerate digital transformation by securely extending their reach across multiple channels – mobile, cloud and Internet of Things. Akana enables enterprises to share data as APIs, connect and integrate applications, drive part...
SYS-CON Events announced today that CommVault has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY, and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. A singular vision – a belief in a better way to address current and future data management needs – guides CommVault in the development of Singular Information Management® solutions for high-performance data protection, universal availability and sim...
SYS-CON Events announced today that SafeLogic has been named “Bag Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. SafeLogic provides security products for applications in mobile and server/appliance environments. SafeLogic’s flagship product CryptoComply is a FIPS 140-2 validated cryptographic engine designed to secure data on servers, workstations, appliances, mobile devices, and in the Cloud.
SYS-CON Events announced today that StorPool Storage will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. StorPool is distributed storage software that allows service providers, enterprises and other cloud builders to run data storage on standard x86 servers, instead of using expensive and inefficient storage arrays (SAN).
SYS-CON Events announced today that Site24x7, the cloud infrastructure monitoring service, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Site24x7 is a cloud infrastructure monitoring service that helps monitor the uptime and performance of websites, online applications, servers, mobile websites and custom APIs. The monitoring is done from 50+ locations across the world and from various wireless carriers, thus providing a global perspective of the end-user experience. Site24x7 supports monitoring H...
SYS-CON Events announced today that Intelligent Systems Services will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Established in 1994, Intelligent Systems Services Inc. is located near Washington, DC, with representatives and partners nationwide. ISS’s well-established track record is based on the continuous pursuit of excellence in designing, implementing and supporting nationwide clients’ mission-critical systems. ISS has completed many successful projects in Healthcare, Commercial, Manufacturing, ...
SYS-CON Events announced today that B2Cloud, a provider of enterprise resource planning software, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. B2cloud develops the software you need. They have the ideal tools to help you work with your clients. B2Cloud’s main solutions include AGIS – ERP, CLOHC, AGIS – Invoice, and IZUM
The IoT Bootcamp is coming to Cloud Expo | @ThingsExpo on June 9-10 at the Javits Center in New York. Instructor. Registration is now available at http://iotbootcamp.sys-con.com/ Instructor Janakiram MSV previously taught the famously successful Multi-Cloud Bootcamp at Cloud Expo | @ThingsExpo in November in Santa Clara. Now he is expanding the focus to Janakiram is the founder and CTO of Get Cloud Ready Consulting, a niche Cloud Migration and Cloud Operations firm that recently got acquired by Aditi Technologies. He is a Microsoft Regional Director for Hyderabad, India, and one of the f...