Welcome!

Apache Authors: Pat Romanski, Liz McMillan, Elizabeth White, Janakiram MSV, Gil Allouche

Related Topics: @BigDataExpo, Java IoT, Linux Containers, Containers Expo Blog, @CloudExpo, Apache

@BigDataExpo: Blog Post

The Hadoop Business Case By @JGlesner | @BigDataExpo [#BigData]

Big Data is 'data that exceeds the processing capacity of conventional database systems'

By

Since the birth of Hadoop in 2005-06, the way we think about storing and processing information has evolved considerably. The term “Big Data” has become synonymous with this evolution. But still, many of our customers continue to ask, “What is Big Data?”, “What are its use cases?”, and “What is its business value?”. The Internet is overloaded with definitions, characteristics, and benefits; however, few discussions synthesize all three of these topics in one place. This paper answers these questions, and proposes a total cost calculation framework for CTOs and CIOs that are evaluating solutions for their organization’s use case(s). In the text below, I examine an on-premise Hadoop ecosystem as a general purpose Big Data solution in relation to alternative commercial purpose-built storage technologies (-e.g. Oracle, Teradata, IBM, SAP, Microsoft, EMC, etc). It may be difficult to determine the exact point at which you should leverage one over the other. It is my contention that when the total cost of using all your data exceeds what you are able to spend using purpose-built technologies, it is time to consider using a general purpose solution like Hadoop for process offloading.

What is Big Data?
According to Edd Dumbill, a well respected thought leader and VP of Strategy for Silicon Valley Data Science, a big data and data science consulting company, Big Data is “data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or does not fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.” This definition was published in an article entitled “What is Big Data?” in Big Data Now: 2012 Edition by O’Reilly Media, and touches on the three primary characteristics of data [1]:

  • Volume: The size of your data. Data has mass, and there is a cost to moving it around the network.
  • Velocity: The speed with which the data either arrives or is created, and how quickly it needs to be consumed in order to make use of it.
  • Variety: The differences in structure between all the types of data within an enterprise.

Scour the internet and you will find that there are other, less commonly discussed but relevant characteristics associated with Big Data to include Veracity and Volatility [2]. Veracity refers to the truthfulness of the data or your degree of trust in what it is conveying. Volatility is how often your existing data changes or is updated by the new data you are receiving/creating. There are certainly other characteristics as well. I recently began using the term Viscosity to describe the degree of data fragmentation in client environments, and the level of effort required to reassemble it into a coherent view. In this context, organizations with low viscosity have significant fragmentation, and duplication throughout their enterprise.

The term “Big Data” has come to focus on these characteristics and imply that traditional database architectures, such as on line transaction processing (OLTP) and on line analytic processing (OLAP) purpose-built technologies simply will not scale to meet your data capacity needs. However, massively parallel processing (MPP) database architectures are one example of purpose-built technologies that have been developed to support both OLTP and OLAP data structures at enormous scales up into the petabytes (PB). For example, there is a 50 PB MPP cluster at eBay [3]. Certainly this size conforms to any logical definition of Big Data.

Depending on your use case, it is possible that a purpose-built technology may suite your needs at scale. While MPP systems remain most effective with structured, tabular and transactional data sets, it is possible to store most everything except massive files in relational structures. However, this may not be the best fit for your use case(s) in terms of appropriateness or cost. There is limited published pricing data for commercial offerings, but MPP systems are notably expensive. When including the cost of software, hardware, and licensing/support, the cost per terabyte (TB) of an MPP system is estimated at tens of thousands of dollars [4]. At these prices, a one PB system can cost tens of millions of dollars. In contrast, the equivalent cost of Hadoop is roughly $2,000 per TB, leaving a one PB Hadoop cluster to cost roughly $2 million. That is a significant initial cost savings; however, use case(s) will always drive the total cost of any solution.

Coincidentally, eBay has also released information on their production 50 PB Hadoop cluster, one of the largest such clusters in the world [5]. The fact that eBay uses both types of systems demonstrates that there is a place for each, and that the difference may come down to price and purpose. Given the relative lower cost of Hadoop, I submit that it is easier to identify Big Data if we add cost to our definition. Therefore, Big Data is the result when (a) the sum of all your data’s characteristics coupled with (b) the resources required to achieve your use case exceeds (c) the cost you are willing/able to spend using traditional approaches. When that inflection point is reached, it is clearly time to consider other, non-traditional approaches for process offloading. Each unique situation warrants a cost/benefit analysis to determine if a general-purpose solution like Hadoop is right for your use case.

What are the Use Cases for Big Data?
Process offloading refers to the act of moving workloads from one implementation to another to achieve better suitability, performance, availability, etc., at a lower price point. Both traditional and non-traditional solutions have advantages and disadvantages given a particular workload, and they should be leveraged accordingly to maximize cost efficiencies. Let us examine Hadoop’s use cases for process offloading.

Hadoop is comprised of two major components: the Hadoop Distributed File System (HDFS) and MapReduce, a framework for writing applications to process large amounts of content over multiple nodes (servers). Hadoop is often referred to as a schemaless system because data is not forced into a schema upon ingest. Ultimately, there is a structure known as the key/value pair in which data is expressed as a collection of [key]->[value] tuples or records. This is the most fundamental data structure in computer science. Hadoop uses the key/value pair because nearly any data can be expressed, stored, processed and retrieved using this minimal structure. Because key/value is so rudimentary, a schema can be applied at query time based on the question being asked. This adds tremendous flexibility and differs significantly from traditional approaches like OLTP and OLAP, which require you to know/define the data model up front, and have an understanding of the questions you intend to ask. Figure 1 illustrates these different process flow models. Having to know what questions you intend to ask, and constructing a pre-defined schema will add artificial constraints to the answers you are able to get from the data.

Another issue with schema-based systems is scalability. Traditional relational architectures scale vertically with ease, but are difficult to design for horizontal scaling due to their rigid data structures (tables, table relationships, rows, columns, indices) which must be sharded or split across multiple nodes. The integrity of these structures must be maintained while offering near-real-time (on line) create, read, update and delete (CRUD) operations on data. This is not trivial, and it requires commercial companies to make significant financial investments to do it well, which drive up the cost of those solutions. As a schemaless system, the latest release of Hadoop (2.x) scales horizontally to 10,000+ nodes without the added complexity inherent to traditional MPP systems [6].

Many organizations have purpose-built solutions for asking business intelligence questions, providing disaster recovery/backup, etc., but scaling these solutions beyond an initial, narrowly defined usage for structured data usually involves significant cost increases. As a schemaless computational file system, Hadoop can be applied to an almost endless set of challenges at a lower cost. Below we walk through six higher-order use cases to illustrate how these savings can be realized:

1. Raw Storage/Data Lake: Backing up all the data your enterprise collects and creates daily, to include its historical holdings, for continuity of operations (COOP) and disaster recovery (DR) has previously been too expensive, and therefore unfeasible. Instead, businesses make difficult tradeoffs as to what will and will not be recoverable should disaster strike. Imagine the possibilities if you were able to economically store everything in your enterprise for the price of traditional commodity hard disks. Fortunately, Hadoop makes this dream a reality with its internally redundant data structure that by default makes three copies of all data written to HDFS. This scalable, schemaless raw storage lends itself conceptually to what is now being called a “data lake”. A data lake is based on the notion that data can be tagged with metadata about its source, contents, structure and other characteristics. These properties stay with the data as it is minimized into key-value pairs and written to the Hadoop file system. To process the data, all one needs to know is what data they wish to process leveraging these properties. This allows many different types of data to exist side-by-side within the simple structures of the Data Lake. The amount of pre-processing is minimal, as data is no longer fit into specific schemas up-front, making the data accessible to a wider variety of purposes. This would not be cost effective using traditional commercial systems.

2. Multi-Format Data Analysis: There are many different types of data beyond structured and unstructured text, to include audio, video, and images. Analyzing structured and unstructured text at scale can be an expensive and difficult challenge, but analyzing large collections of digital media is not even possible using traditional relational systems. Many businesses have previously been unable to unlock the potential of their data holdings due to an inability to process digital content, such as the ability to analyze and track objects in video, or to identify and extract biomarkers in healthcare images. HDFS accepts all these formats for analysis without the need for a schema. Hadoop’s ability to work with unstructured text and binary data (audio, video, imagery) extends well beyond the native capabilities offered by existing storage solutions, providing an enormous capability advantage.

3. Data Cleansing/Transformation Businesses often contend with multiple relational data models, unstructured text and streaming data. You likely need to correlate, cleanse, de-duplicate, synchronize and normalize/de-normalize these data sets as they move between databases and tools to create a complete, clean operating picture for downstream analysis. The vast majority of work in conducting analytics is often preparing the data for use. In addition, new initiatives to leverage autonomous self-reporting devices and sensors provide continuous streams of data, creating explosions in the amount of information if used in their raw form. Purpose-built technologies present challenges when attempting these types of tasks due to their reliance on schemas. General purpose solutions, like the Hadoop ecosystem, deliver an economical way of storing, pre-processing and/or summarizing these data sets and streams, thereby minimizing the unchecked growth in commercial licensing investments within your enterprise.

4. Data Exploration: When new questions arise, the relevant variables and their relationships must be identified from within your data before you can begin to calculate definitive answers. However, these elements are not always understood, nor are the best algorithms for analyzing the data. Exploration is often required in order to build a model that will answer the questions being asked. Traditional relational architectures with pre-defined schemas are not likely to provide a platform for discovery. In these cases, identifying key variables and useful analytic methods is a trial/error process. Hadoop provides a flexible, schemaless environment that reduces the friction associated with the iterative process of exploring and analyzing data when the model is unclear. Hadoop provides a sandbox for exploring data without having to increase commercial capacity or spend the time building new schemas.

5. Data Science & Personalization: Data science leverages tools and techniques from many different areas of study, to include statistics, machine learning, mathematics, probability/uncertainty modeling, etc., to surface meaning from data, and generate data-driven products. This is essentially the art of making data actionable, either by a user or a machine. Data science is not exclusive to Big Data, but there is tremendous knowledge potential in large data sets. One use of data science is for personalization, the act of exhaustively analyzing large quantities of related data, such as the online behaviors of millions of Internet users to in order to calculate recommendations for a specific individual. The results are then presented in the form of “you might also like” books, movies, and other targeted advertisements. These techniques are also being applied to healthcare where symptoms, genetics, treatments, and outcomes are being analyzed to optimize treatment for specific individuals to optimize treatments. Hadoop is a perfect platform for data collecting, synthesizing, munging, cleaning and joining disparate data sets for analysis to achieve decision-relevant insight.

6. Data Anonymization: Certain industries, perhaps healthcare more than any other, require anonymized data for research. Rules governing the release of such data to the public generally require the information contain no personally identifiable information (PII). In the case of healthcare specifically, the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule, released in 2003, governs the use and disclosure of protected health information (PHI). The Privacy Rule does not give a specific algorithm for achieving the level of de-identification required, and there are many ways to approach anonymization in general, which depend greatly on how the data will be used. If a portion of your business model relies on providing anonymized data to internal or external groups for analysis, you want that process to be as clear, efficient, and repeatable as possible. Hadoop provides a solution for codifying and institutionalizing these algorithms for your enterprise. This increases the speed and effectiveness of all groups depending on anonymized data, providing them with an approved, documented process, and an authoritative source from which to receive data.

This list is not meant to be exhaustive, and there are definitely other use cases. Each use case is applicable to a wide variety of domains, to include finance, cyber, healthcare, defense, and scientific research.

What is the Business Value of Big Data?
Our new definition of Big Data (when the cost of using all your data for your use case exceeds what you are able to spend using purpose-built technologies) lends itself to a cost/benefit analysis. Figure 2 establishes a rubric through which to express the decision calculus of Big Data for process offloading. This framework illustrates the components of cost, discussed below, that every CIO and CTO should take into account when evaluating solutions for their use cases.

Projects should always start with gathering and analyzing requirements. In an analytic context, these are the questions you want to ask of your data. Or more generally, how you intend to use the data you would like to store in Hadoop. These requirements have obvious implications for leveraging the relevant data assets.

The Data / Characteristics (AS-IS) corner of the triangle refers to all data related to your requirements, and all the attributes discussed earlier, to include the amount, how quickly it grows/changes, differences in type/structure, where it resides on the network, etc.

Once the associated data has been identified, a solution is identified and designed. In the case of data analysis, the solution often involves models and techniques to change and analyze the data to find answers. Overall, this step includes any processes, human or machine, that are necessary to get the results you are looking for.

The Purpose / Answers (TO-BE) corner is your end-state vision, which is sometimes expressed in terms of success criteria and/or key performance indicators. In the case of data science, this corner represents the answers you want from your data, in addition to how users should expect to access those answers, and how frequently the answers need to be updated (real-time, hourly, daily, monthly, etc).

Lastly, there are often numerous ways for this solution to be physically implemented. Each possible implementation requires specific people, intellect (expertise, experience), technology (licenses, support), time, and physical capital (power, space, cooling) to assemble and extend (write algorithms, or build solutions on top of) the desired end-state. There are many factors here to consider. For example, certain software licenses will charge by the number of users, which may limit your derived business value (in terms of productivity) if that cost prevents your entire team from leveraging the software. As well, the more data you have, the more physical or virtual compute resources you may need.

Together, these elements influence the total cost of the solution. Ultimately, cost is the tipping point that can cause you to change the scope of your requirements and timeline, the data you use, the models/techniques you employ, the answers you are able to achieve and the algorithms/technologies you implement. Often, it is necessary to find an affordable balance to achieve the organization’s goals and objectives. However, these trade-offs may cause you to compromise certain business objectives, and reduce the business value derived from the solution.

The business value of Hadoop is the result of overcoming the functional limitations established by the cost of scaling purpose-built technologies, and having to make fewer compromises to achieve your data-driven business objectives. This relationship between cost and business value is illustrated in Figure 2. By managing (containing or reducing) cost, it becomes possible to maintain or broaden your scope and implement the solution that is right for you. Hadoop may allow you to get more from your data, with a significantly lower cost investment, resulting in tangible economic value. If Hadoop is able to satisfy your use case, then it is likely you will benefit from cost containment (and possibly savings) by preventing or reducing the expansion of more expensive purpose-built technologies.

Conclusion
It is important to choose the right technology for your particular use case. Hadoop continues to mature as a widely supported open source solution nearing its ten year anniversary. It is also supported by several commercial vendors offering on-site support. Depending on your particular use case(s), Hadoop may or may not be the best solution. Some Big Data is consistent, known, structured, and aligns well to the use cases best served by purpose-built technologies. However, when you do not have that, or cost constraints limit your business value, it is time to consider using a general purpose solution like the Hadoop ecosystem for process offloading. The formula presented in this paper provides a lens for CIOs/CTOs to examine potential solutions, business objectives, and cost constraints. Hadoop’s low cost and broad applicability are definitely worth exploring. I recommend you conduct your own cost/benefit analysis to determine if Hadoop is right for you and your use case(s). You may find that relative to commercial products, Hadoop will allow you to achieve greater business value and substantial cost savings.


[1] O’Reilly Media, Inc., “What is Big Data?”. Big Data Now: 2012 Edition. Sebastopol, CA. October 2012. Found Online at: http://www.oreilly.com/data/free/big-data-now-2012.csp

[2] Normandeau, Kevin. “Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity”. Inside BigData. September 2013. http://inside-bigdata.com/2013/09/12/beyond-volume-variety-velocity-issu...

[3] Harris, Derrick. “Teradata pluges 17% on Q3 warning: Is it economics or Hadoop?” Gigaom. October 2013. https://gigaom.com/2013/10/15/teradata-plunges-17-on-q3-warning-is-it-ec...

[4] Barth, Paul; Bean, Randy. “Get the Maximum Value Out of Your Big Data Initiative”. Harvard Business Review Blog Network. February 2013. http://blogs.hbr.org/2013/02/get-the-maximum-value-out-of-y/

[5] Ma, Ming. “Hadoop @ eBay Marketplaces”. Slideshare. June 2013. http://www.slideshare.net/Hadoop_Summit/ma-june27-140pmroom212v2

[6] Murthy, Arun. “Apache Hadoop YARN – Concepts and Applications”. Hortonworks. August 2012. http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/


About the author:

Jeremy Glesner is the Chief Technology Officer of Berico Technologies. Jeremy’s background is in information science and software engineering. Find him on Twitter at @jglesner (https://twitter.com/jglesner) and on Linkedin (http://www.linkedin.com/in/jeremyglesner/).

 

Read the original blog entry...

More Stories By Bob Gourley

Bob Gourley writes on enterprise IT. He is a founder and partner at Cognitio Corp and publsher of CTOvision.com

@ThingsExpo Stories
19th Cloud Expo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterpri...
Smart Cities are here to stay, but for their promise to be delivered, the data they produce must not be put in new siloes. In his session at @ThingsExpo, Mathias Herberts, Co-founder and CTO of Cityzen Data, will deep dive into best practices that will ensure a successful smart city journey.
SYS-CON Events announced today that 910Telecom will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Housed in the classic Denver Gas & Electric Building, 910 15th St., 910Telecom is a carrier-neutral telecom hotel located in the heart of Denver. Adjacent to CenturyLink, AT&T, and Denver Main, 910Telecom offers connectivity to all major carriers, Internet service providers, Internet backbones and ...
There is growing need for data-driven applications and the need for digital platforms to build these apps. In his session at 19th Cloud Expo, Muddu Sudhakar, VP and GM of Security & IoT at Splunk, will cover different PaaS solutions and Big Data platforms that are available to build applications. In addition, AI and machine learning are creating new requirements that developers need in the building of next-gen apps. The next-generation digital platforms have some of the past platform needs a...
SYS-CON Events announced today Telecom Reseller has been named “Media Sponsor” of SYS-CON's 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
I wanted to gather all of my Internet of Things (IOT) blogs into a single blog (that I could later use with my University of San Francisco (USF) Big Data “MBA” course). However as I started to pull these blogs together, I realized that my IOT discussion lacked a vision; it lacked an end point towards which an organization could drive their IOT envisioning, proof of value, app dev, data engineering and data science efforts. And I think that the IOT end point is really quite simple…
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices - comp...
The 19th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Digital Transformation, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportuni...
DevOps at Cloud Expo, taking place Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long dev...
Data is the fuel that drives the machine learning algorithmic engines and ultimately provides the business value. In his session at Cloud Expo, Ed Featherston, a director and senior enterprise architect at Collaborative Consulting, will discuss the key considerations around quality, volume, timeliness, and pedigree that must be dealt with in order to properly fuel that engine.
Pulzze Systems was happy to participate in such a premier event and thankful to be receiving the winning investment and global network support from G-Startup Worldwide. It is an exciting time for Pulzze to showcase the effectiveness of innovative technologies and enable them to make the world smarter and better. The reputable contest is held to identify promising startups around the globe that are assured to change the world through their innovative products and disruptive technologies. There w...
Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more business becomes digital the more stakeholders are interested in this data including how it relates to business. Some of these people have never used a monitoring tool before. They have a question on their mind like “How is my application doing” but no id...
Personalization has long been the holy grail of marketing. Simply stated, communicate the most relevant offer to the right person and you will increase sales. To achieve this, you must understand the individual. Consequently, digital marketers developed many ways to gather and leverage customer information to deliver targeted experiences. In his session at @ThingsExpo, Lou Casal, Founder and Principal Consultant at Practicala, discussed how the Internet of Things (IoT) has accelerated our abil...
With so much going on in this space you could be forgiven for thinking you were always working with yesterday’s technologies. So much change, so quickly. What do you do if you have to build a solution from the ground up that is expected to live in the field for at least 5-10 years? This is the challenge we faced when we looked to refresh our existing 10-year-old custom hardware stack to measure the fullness of trash cans and compactors.
The emerging Internet of Everything creates tremendous new opportunities for customer engagement and business model innovation. However, enterprises must overcome a number of critical challenges to bring these new solutions to market. In his session at @ThingsExpo, Michael Martin, CTO/CIO at nfrastructure, outlined these key challenges and recommended approaches for overcoming them to achieve speed and agility in the design, development and implementation of Internet of Everything solutions wi...
Cloud computing is being adopted in one form or another by 94% of enterprises today. Tens of billions of new devices are being connected to The Internet of Things. And Big Data is driving this bus. An exponential increase is expected in the amount of information being processed, managed, analyzed, and acted upon by enterprise IT. This amazing is not part of some distant future - it is happening today. One report shows a 650% increase in enterprise data by 2020. Other estimates are even higher....
Identity is in everything and customers are looking to their providers to ensure the security of their identities, transactions and data. With the increased reliance on cloud-based services, service providers must build security and trust into their offerings, adding value to customers and improving the user experience. Making identity, security and privacy easy for customers provides a unique advantage over the competition.
Is the ongoing quest for agility in the data center forcing you to evaluate how to be a part of infrastructure automation efforts? As organizations evolve toward bimodal IT operations, they are embracing new service delivery models and leveraging virtualization to increase infrastructure agility. Therefore, the network must evolve in parallel to become equally agile. Read this essential piece of Gartner research for recommendations on achieving greater agility.
SYS-CON Events announced today that Venafi, the Immune System for the Internet™ and the leading provider of Next Generation Trust Protection, will exhibit at @DevOpsSummit at 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Venafi is the Immune System for the Internet™ that protects the foundation of all cybersecurity – cryptographic keys and digital certificates – so they can’t be misused by bad guys in attacks...
For basic one-to-one voice or video calling solutions, WebRTC has proven to be a very powerful technology. Although WebRTC’s core functionality is to provide secure, real-time p2p media streaming, leveraging native platform features and server-side components brings up new communication capabilities for web and native mobile applications, allowing for advanced multi-user use cases such as video broadcasting, conferencing, and media recording.