|By Jack Norris||
|August 19, 2012 08:15 AM EDT||
We have entered the "Age of Big Data" according to a recent New York Times article. This comes as no surprise to most organizations already struggling with the onslaught of data coming from an increasing number of sources and at an increasing rate. The 2011 IDC Digital Universe Study reported that data is growing faster than Moore's Law. This trend points to a paradigm shift in how organizations process data where isolated islands and silos are being replaced by large clusters of commodity servers that keep data and compute resources together.
Another way of looking at this paradigm shift is that the growing volume and velocity of data require a new approach to networked computing. A good example of this change is found at Google. The industry now takes Google's dominance for granted, but when Google launched its beta search engine in 1998, the company was late entering the market. At the time, Yahoo! was dominant; other contenders included infoseek, excite, Lycos, Ask Jeeves and AltaVista (dominating technical searches). Within two years, Google was the dominant search provider. It wasn't until 2003, when Google published a paper on MapReduce, that the world got a glimpse into Google's back-end architecture.
Google's architecture revealed how the company was able to index significantly more data, to get far better results faster, and to achieve these superior results much more efficiently and cost-effectively than all competitors. The shift Google made was to divide complex data analysis tasks into simple subtasks that could be performed in parallel on commodity servers. Separate processes were being used to Map the data, and then Reduce it into interim or final results. This MapReduce framework would eventually become available to organizations through distributions of Apache Hadoop.
A Brief History of Hadoop
After reading Google's paper in 2003, Yahoo engineer Doug Cutting developed a Java-based implementation of MapReduce, and named it after his son's stuffed elephant, Hadoop. In 2006, Hadoop became a subproject of Lucene (a popular text search library) at the Apache Software Foundation (www.apache.org), and became its own top-level Apache project in 2008.
Essentially, Hadoop provides a way to capture, organize, store, search, share, analyze and visualize disparate data sources (structured, semi-structured and unstructured) across a large cluster of commodity computers, and is designed to scale up from dozens to thousands of servers, each offering local computation and storage.
While there are several elements that are now part of Hadoop, two are fundamental to its operation. The first is the Hadoop Distributed File System (HDFS), which serves as the primary storage system. HDFS replicates and distributes the blocks of source data to the compute nodes throughout the cluster of servers to be analyzed by one or more applications. The second is MapReduce, which creates a software framework and a programming model for writing applications capable of processing vast amounts of distributed data in parallel on very large clusters.
The open source nature of Apache Hadoop creates an ecosystem that facilitates constant advancements in its capabilities, performance, reliability and ease of use. These enhancements can be made by any individual or organization-a global community of contributors-and are then either contributed to the basic Apache library or made available in a separate (often free) commercial distribution.
In effect, Hadoop is a complete system or "stack" for data analysis. The stack includes not only the HDFS and MapReduce foundation, but also job management, development tools, schedulers, machine learning libraries, etc.
KISS: Keep It Simple, Scalable
In a paper titled The Unreasonable Effectiveness of Data, the authors (all research directors from Google) make a contrast between the elegant simplicity of physics (with equations like E = mc2) and other disciplines, noting that, "... sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics."
The fact that simple formulas are fully capable of explaining the complex natural world, while remaining elusive in understanding human behavior, is fundamental to why Hadoop is gaining in popularity. The paper notes the frustration of economists, who lack similar simple equations or models, and explores advances being made in fields like natural language processing-a notoriously complex area that has been studied for years with many attempts at artificial intelligence as a means to gain some insight.
The authors found that relatively simple algorithms applied to massive datasets produced stunning results. One example involves scene completion. An algorithm was used to eliminate something in a picture, a car for instance, and then based on a corpus of thousands of pictures, fill in the missing background. The algorithm performed rather poorly until the corpus was increased to millions of photos. With sufficient data, the same, simple algorithm performed extremely well. This need to find patterns and fill in the "missing pieces" in any puzzle is a common theme in many data analytics applications today.
Data analytics also confronts another inherent complexity: the growth in unstructured and semi-structured data. The sources of unstructured data, such as log files, social media, videos, etc., are growing in both their size and importance. But even structured data that goes through a series of changes eventually loses some or all of its structure. Traditional analytic techniques require considerable preprocessing of unstructured and semi-structured data before being able to produce results, and the results can be wrong or misleading if the preprocessing is somehow flawed.
The ability of Hadoop to employ simple algorithms and obtain meaningful results when analyzing unstructured, semi-structured and structured data in its raw form is unprecedented-and currently unparalleled. MapReduce enables data to be analyzed in an incremental fashion (and with parallel processing) without any need to engage in complex data transformations or to otherwise preprocess any data sources, or to create any schemas or aggregate any data in advance. Sometimes the interim results can be quite revealing on their own, and any unexpected results can be used to further fine-tune additional analysis. In fact, Hadoop was designed to accommodate virtually all forms of data directly, thus eliminating the need to engage in extraordinary measures before being able to unlock the value hidden deeply within.
The Price/Performance of Data Analytics
Not only does Hadoop deliver superior data analytics capabilities and results, it does so (as Google found) with an infrastructure that is far more cost-effective than traditional data analysis tools. The reason is that scaling data analytics capabilities has long been subject to the 80/20 rule: Big gains can be achieved with little initial effort (and cost), but the returns diminish as the datasets grow to become Big Data.
In stark contrast, Hadoop can scale linearly, which is the key to both effective and cost-effective data analytics. As datasets grow, traditional data analysis environments scale in an exponential fashion, causing the additional cost required to gain additional insight to eventually become prohibitive. With Hadoop, by contrast, the cluster of commodity (read: inexpensive) servers with direct-attached storage scales linearly with the growth in the number and sizes of datasets.
Hadoop's ability to satisfy these prerequisites well is the reason for its growing popularity in Web-based businesses and data-intensive organizations, as well as at aggressive start-ups. For the former, the need to wrestle with truly Big Data justifies the need for a data analytics environment like Hadoop. For the latter, the lack of anything legacy makes it easy to benefit from Hadoop's advantages.
One major challenge to Hadoop adoption, however, remains its file system. HDFS is an append-only storage that requires data to be batch loaded in a Hadoop cluster and then later exported post-processing for use by other applications that don't support the HDFS API. And Big Data can be difficult and costly to move back and forth in this fashion owing to the inherent duplication of data across the "semantic wall" between the existing and Hadoop infrastructures.
Another barrier to production adoption of Hadoop in larger organizations involves the extraordinary measures required to make the environment dependable. Constant care is needed to ensure that single points of failure (especially in the NameNode and JobTracker) cannot cause catastrophe, and that in the case of data loss, data can be re-loaded into the Hadoop cluster.
Breaking Through the Barriers
These problems with Hadoop are, themselves, becoming part of the past. Open source communities can be quite large, creating a vibrant ecosystem. This is the case with Hadoop, where several companies are now providing commercial distributions based on open source Hadoop.
The growing number of commercial Hadoop distributions available is systematically breaking through the barriers to widespread adoption. In general, these distributions provide enhancements that make Hadoop easier to integrate into the enterprise, as well as more enterprise-class in its operation, performance and reliability. One way of achieving these enhancements is to use existing and standard communications protocols as a foundation to enable more seamless integration between legacy and Hadoop environments.
Such a common foundation facilitates making the paradigm shift in data analytics in virtually any organization. It eliminates the need to throw data back and forth over a "semantic wall" by tearing down that wall. The compatibility afforded also extends beyond the physical infrastructure and into development environments and routine operating procedures, especially those involving data protection, such as snapshots and mirroring. With standards-based file access into the Hadoop cluster, existing applications and tools, and even ordinary browsers are able to access the data directly and in real-time (vs. Hadoop's traditional batch processing.
The End - or Just the Beginning
The data analytics paradigm is changing, and the change presents a real opportunity for established organizations to take full advantage of some new and powerful capabilities without sacrificing any existing ones. Just as Google was able to do, Hadoop makes it possible for any organization to gain a significant competitive edge by taking full advantage of the insight provided by this paradigm shift.
Hadoop is indeed a game-changing technology, and Hadoop is now itself changing with the advent of enterprise-class commercial distributions. By making Hadoop more mission-critical in its operation (potentially with the same or an even lower total cost of ownership), these "next-generation" solutions make beginning the shift to the new data analytics paradigm less risky and more rewarding than ever before.
Cloud computing is being adopted in one form or another by 94% of enterprises today. Tens of billions of new devices are being connected to The Internet of Things. And Big Data is driving this bus. An exponential increase is expected in the amount of information being processed, managed, analyzed, and acted upon by enterprise IT. This amazing is not part of some distant future - it is happening today. One report shows a 650% increase in enterprise data by 2020. Other estimates are even higher....
Aug. 31, 2016 01:45 AM EDT Reads: 3,066
Smart Cities are here to stay, but for their promise to be delivered, the data they produce must not be put in new siloes. In his session at @ThingsExpo, Mathias Herberts, Co-founder and CTO of Cityzen Data, will deep dive into best practices that will ensure a successful smart city journey.
Aug. 31, 2016 01:00 AM EDT Reads: 1,701
Identity is in everything and customers are looking to their providers to ensure the security of their identities, transactions and data. With the increased reliance on cloud-based services, service providers must build security and trust into their offerings, adding value to customers and improving the user experience. Making identity, security and privacy easy for customers provides a unique advantage over the competition.
Aug. 30, 2016 08:15 PM EDT Reads: 2,489
SYS-CON Events announced today that 910Telecom will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Housed in the classic Denver Gas & Electric Building, 910 15th St., 910Telecom is a carrier-neutral telecom hotel located in the heart of Denver. Adjacent to CenturyLink, AT&T, and Denver Main, 910Telecom offers connectivity to all major carriers, Internet service providers, Internet backbones and ...
Aug. 30, 2016 08:00 PM EDT Reads: 2,031
SYS-CON Events announced today that Pulzze Systems will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Pulzze Systems, Inc. provides infrastructure products for the Internet of Things to enable any connected device and system to carry out matched operations without programming. For more information, visit http://www.pulzzesystems.com.
Aug. 30, 2016 07:15 PM EDT Reads: 356
There is growing need for data-driven applications and the need for digital platforms to build these apps. In his session at 19th Cloud Expo, Muddu Sudhakar, VP and GM of Security & IoT at Splunk, will cover different PaaS solutions and Big Data platforms that are available to build applications. In addition, AI and machine learning are creating new requirements that developers need in the building of next-gen apps. The next-generation digital platforms have some of the past platform needs a...
Aug. 30, 2016 07:00 PM EDT Reads: 936
Data is an unusual currency; it is not restricted by the same transactional limitations as money or people. In fact, the more that you leverage your data across multiple business use cases, the more valuable it becomes to the organization. And the same can be said about the organization’s analytics. In his session at 19th Cloud Expo, Bill Schmarzo, CTO for the Big Data Practice at EMC, will introduce a methodology for capturing, enriching and sharing data (and analytics) across the organizati...
Aug. 30, 2016 06:15 PM EDT Reads: 372
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices - comp...
Aug. 30, 2016 03:30 PM EDT Reads: 3,782
SYS-CON Events announced today Telecom Reseller has been named “Media Sponsor” of SYS-CON's 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
Aug. 30, 2016 02:30 PM EDT Reads: 1,060
SYS-CON Events announced today that Adobe has been named “Bronze Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. Adobe is changing the world though digital experiences. Adobe helps customers develop and deliver high-impact experiences that differentiate brands, build loyalty, and drive revenue across every screen, including smartphones, computers, tablets and TVs. Adobe content solutions are used daily by millions of co...
Aug. 30, 2016 02:00 PM EDT Reads: 3,816
Why do your mobile transformations need to happen today? Mobile is the strategy that enterprise transformation centers on to drive customer engagement. In his general session at @ThingsExpo, Roger Woods, Director, Mobile Product & Strategy – Adobe Marketing Cloud, covered key IoT and mobile trends that are forcing mobile transformation, key components of a solid mobile strategy and explored how brands are effectively driving mobile change throughout the enterprise.
Aug. 30, 2016 01:45 PM EDT Reads: 703
Pulzze Systems was happy to participate in such a premier event and thankful to be receiving the winning investment and global network support from G-Startup Worldwide. It is an exciting time for Pulzze to showcase the effectiveness of innovative technologies and enable them to make the world smarter and better. The reputable contest is held to identify promising startups around the globe that are assured to change the world through their innovative products and disruptive technologies. There w...
Aug. 30, 2016 01:30 PM EDT Reads: 891
Data is the fuel that drives the machine learning algorithmic engines and ultimately provides the business value. In his session at Cloud Expo, Ed Featherston, a director and senior enterprise architect at Collaborative Consulting, will discuss the key considerations around quality, volume, timeliness, and pedigree that must be dealt with in order to properly fuel that engine.
Aug. 30, 2016 01:15 PM EDT Reads: 2,075
19th Cloud Expo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterpri...
Aug. 30, 2016 01:00 PM EDT Reads: 3,245
Almost two-thirds of companies either have or soon will have IoT as the backbone of their business in 2016. However, IoT is far more complex than most firms expected. How can you not get trapped in the pitfalls? In his session at @ThingsExpo, Tony Shan, a renowned visionary and thought leader, will introduce a holistic method of IoTification, which is the process of IoTifying the existing technology and business models to adopt and leverage IoT. He will drill down to the components in this fra...
Aug. 30, 2016 10:30 AM EDT Reads: 429
With so much going on in this space you could be forgiven for thinking you were always working with yesterday’s technologies. So much change, so quickly. What do you do if you have to build a solution from the ground up that is expected to live in the field for at least 5-10 years? This is the challenge we faced when we looked to refresh our existing 10-year-old custom hardware stack to measure the fullness of trash cans and compactors.
Aug. 30, 2016 02:30 AM EDT Reads: 1,878
The emerging Internet of Everything creates tremendous new opportunities for customer engagement and business model innovation. However, enterprises must overcome a number of critical challenges to bring these new solutions to market. In his session at @ThingsExpo, Michael Martin, CTO/CIO at nfrastructure, outlined these key challenges and recommended approaches for overcoming them to achieve speed and agility in the design, development and implementation of Internet of Everything solutions wi...
Aug. 30, 2016 02:00 AM EDT Reads: 2,273
Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more business becomes digital the more stakeholders are interested in this data including how it relates to business. Some of these people have never used a monitoring tool before. They have a question on their mind like “How is my application doing” but no id...
Aug. 30, 2016 01:00 AM EDT Reads: 1,961
DevOps at Cloud Expo, taking place Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long dev...
Aug. 29, 2016 10:00 PM EDT Reads: 2,519
The 19th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Digital Transformation, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportuni...
Aug. 28, 2016 10:30 PM EDT Reads: 4,103