Welcome!

Apache Authors: Carmen Gonzalez, Elizabeth White, Pat Romanski, Liz McMillan, Christopher Harrold

Related Topics: @DXWorldExpo, @CloudExpo, Apache

@DXWorldExpo: Article

The 'Big' Fallacy of Big Data | @BigDataExpo #BigData

Why companies are luring you into the Big Data Trap

Unless you've been living under a rock for the past couple of years, you've been hearing about the world of Big Data nonstop. Big Data promises fortune and power to those that can wield the somewhat mystical and often nebulous power of "Big Data". Unfortunately for the rest of us mere mortals Big Data is built on an out-right lie that is both pernicious and unfortunate. It's hiding right there in plain sight in the name itself. The word, BIG.

The Fallacy of Big Data is that you have to have a lot of data for it to be relevant. The common catch phrase is: "More data = more insights". There is a nugget of truth to this in that, in some cases, a lot of data is needed in order to establish valid patterns and create real insight into the activity the data represents. More often than not however, this creates a significant challenge to those responsible for performing analytics which is sifting through a mountain of data to find the parts that actually matter. Recent studies have shown that fully 80% of data analysis is spent just tinkering with the data to get it into a usable format. So we see that more data creates a massive data curation issue, and leaves us with more work to do to even start experimenting, much less monetizing our data.

The reality of "Big Data" is that it was invented by those with no skin in the game. Analytics, open source, digital transformation, and Cloud are all of the technologies that enable comprehensive data analysis. With minimal infrastructure, commodity hardware, and free or nearly free software to store, analyze, and more importantly drive value from that data, the big infrastructure players are left out in the cold with nothing to offer. Enter "Big Data", because if you are going to try and manage petabytes of data you need good storage, and 10's of thousands of servers is awful to manage. So the Fallacy is born:

"In order to get real results from data, you cannot rely on just a little bit of it, or just the relevant data, you need every set of data imaginable. Therefore, (and here's where things get squidgy) you need to bring all that data in house (because the cloud is too expensive to store it) and you need a lot of manageable and flexible enterprise-grade gear to do it with (because free stuff is not enterprise ready)."

You can see how this is built around some nuggets of truth. I was asked recently, "how would you move a petabyte of data to Amazon cloud storage?" and I answered as truthfully as I could, "Very Slowly". Cloud does get expensive when used for a lot of infrastructure, but when used as a part of the overall solution it is an important tool. Also the thought of managing a massive Hadoop cluster of 1000 "exactly the same" servers sounds like the hell of IT in the pre-VM days, but it is also not really an accurate picture of the Hadoop landscape. The vast majority of analytics clusters top out around 50 servers and that's far more manageable (and less expensive) than huge enterprise gear. To be fair, there are organizations out there where a massive-scale, enterprise platformed approach will make sense, but the unfortunate side effect of this approach by legacy vendors is that they have made the solution itself the barrier to entry.

The problem is that now "Big Data" has made it into the vernacular and worse yet, has become synonymous with Data Analytics. Every company, organization, or even individual on earth can benefit from analyzing their relevant data for new insights. Take a very simple example; look at your budget to identify where you overspend (too many meals out for example). That is personal analytics, it does not require complex anything, and there are numerous ways to do it with free or nearly free tools. Now scale that up to the bank that wants to offer new digital, data-driven products to customers. They already have a lot of that data in house, and they already have a lot of analytical tools. Why would they need, per-se, to include every data set under the sun? They may want some more sets of data (social media to identify trends that might lead to investment opportunity), but they don't HAVE to have it stored in house to use it - it is all offered free-to-use via serialized API's. In the unique case where if they did decide to store it all in house, we are not talking about 10's of PB of data. More like adding a few 10's to 100's of TB for the data in question, because again - you don't download all of Twitter, just the stuff that is relevant to you. Also analytic data is largely transient data, meaning that it is used for the analysis and then discarded (especially true in the real-time world), so where is the need for massive infrastructure to support that initiative?

I have spoken a lot about "Big Data" and the Fallacy and trap of paying too much attention to the word BIG. Data is important to everyone and it can have value for anyone. In my most recent speaking sessions I have shown how you can do a simple social analysis for free in a matter of minutes. You don't need a massive infrastructure to make that production ready either. It just takes some willingness to see through the noise to the actual value of what the "Big Data" message is trying to say. Analytics is important and valuable for everyone. You don't have to be a Fortune 100 company to create value from the data you already have, and to bring in new data for analytics. Everyone can do it.

For more thought provoking content on Big Data and Data Analytics, click here.

Connect with  me on Twitter or LinkedIn and share your thoughts!

More Stories By Christopher Harrold

As an Agent of IT Transformation, I have over 20 years experience in the field. Started off as the IT Ops guy and followed the trends of the DevOps movement wherever I went. I want to shake up accepted ways of thinking and develop new models and designs that push the boundaries of technology and of the accepted status quo. There is no greater reward for me than seeing something that was once dismissed as "impossible" become the new normal, and I have been richly rewarded throughout my career with this result. In my last role as CTO at EMC Corporation, I was working tirelessly with a small group of engineers and product managers to build a market leading, innovative platform for data analytics. Combining best of breed storage, analytics and visualization solutions that enables the Data as a Service model for enterprise and mid sized companies globally.

IoT & Smart Cities Stories
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
Atmosera delivers modern cloud services that maximize the advantages of cloud-based infrastructures. Offering private, hybrid, and public cloud solutions, Atmosera works closely with customers to engineer, deploy, and operate cloud architectures with advanced services that deliver strategic business outcomes. Atmosera's expertise simplifies the process of cloud transformation and our 20+ years of experience managing complex IT environments provides our customers with the confidence and trust tha...
In his general session at 19th Cloud Expo, Manish Dixit, VP of Product and Engineering at Dice, discussed how Dice leverages data insights and tools to help both tech professionals and recruiters better understand how skills relate to each other and which skills are in high demand using interactive visualizations and salary indicator tools to maximize earning potential. Manish Dixit is VP of Product and Engineering at Dice. As the leader of the Product, Engineering and Data Sciences team at D...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
AI and machine learning disruption for Enterprises started happening in the areas such as IT operations management (ITOPs) and Cloud management and SaaS apps. In 2019 CIOs will see disruptive solutions for Cloud & Devops, AI/ML driven IT Ops and Cloud Ops. Customers want AI-driven multi-cloud operations for monitoring, detection, prevention of disruptions. Disruptions cause revenue loss, unhappy users, impacts brand reputation etc.
The Japan External Trade Organization (JETRO) is a non-profit organization that provides business support services to companies expanding to Japan. With the support of JETRO's dedicated staff, clients can incorporate their business; receive visa, immigration, and HR support; find dedicated office space; identify local government subsidies; get tailored market studies; and more.
As you know, enterprise IT conversation over the past year have often centered upon the open-source Kubernetes container orchestration system. In fact, Kubernetes has emerged as the key technology -- and even primary platform -- of cloud migrations for a wide variety of organizations. Kubernetes is critical to forward-looking enterprises that continue to push their IT infrastructures toward maximum functionality, scalability, and flexibility. As they do so, IT professionals are also embr...
As you know, enterprise IT conversation over the past year have often centered upon the open-source Kubernetes container orchestration system. In fact, Kubernetes has emerged as the key technology -- and even primary platform -- of cloud migrations for a wide variety of organizations. Kubernetes is critical to forward-looking enterprises that continue to push their IT infrastructures toward maximum functionality, scalability, and flexibility.
Today's workforce is trading their cubicles and corporate desktops in favor of an any-location, any-device work style. And as digital natives make up more and more of the modern workforce, the appetite for user-friendly, cloud-based services grows. The center of work is shifting to the user and to the cloud. But managing a proliferation of SaaS, web, and mobile apps running on any number of clouds and devices is unwieldy and increases security risks. Steve Wilson, Citrix Vice President of Cloud,...
When Enterprises started adopting Hadoop-based Big Data environments over the last ten years, they were mainly on-premise deployments. Organizations would spin up and manage large Hadoop clusters, where they would funnel exabytes or petabytes of unstructured data.However, over the last few years the economics of maintaining this enormous infrastructure compared with the elastic scalability of viable cloud options has changed this equation. The growth of cloud storage, cloud-managed big data e...