Welcome!

Apache Authors: Liz McMillan, Pat Romanski, Elizabeth White, Christopher Harrold, Janakiram MSV

Related Topics: @CloudExpo

@CloudExpo: Blog Post

Introducing Big Data

When we say Big Data, we enter the petabyte space (1000 terabytes)

The phrase “Big Data” is thrown around a lot these days. What exactly is referred to by this phrase? When I was part of IBM’s DB2 development team, the largest size limit of a DB2 Table was 64 Gigabytes (GB) and I thought who on earth can use this size of a database. Thirty years later, that number looks so small. Now you can buy a 1 Terabyte external drive for less than $100.

Let us start with a level set on the unit of storage. In multiples of 1000, we go from Byte – Kilobyte (KB) – Megabyte (MB) – Gigabyte (GB) – Terabyte (TB) – Petabyte (PB) – Exabyte (EB) – Zettabyte (ZB) – Yottabyte (YB). The last one YB is 10 to the power of 24. A typed page is 2KB. The entire book collection at the US Library of Congress is 15TB. The amount of data processed in one hour at Google is 1PB. The total amount of information in existence is around 1.27ZB. Now you get some context to these numbers.

When we say Big Data, we enter the petabyte space (1000 Terabytes). There is talk of “personal petabyte” to store all your audio, video, and pictures. The cost has come down from $2M in 2002 to $2K in 2012 – real Moore’s law in disk storage technology here. This is not the stuff for current commercial database products such as DB2 or Oracle or SQLServer. Such RDBMS’s handle maximum of 10 to 100 Terabyte sizes. Anything bigger would cause serious performance nightmares. These large databases are mostly in the decision support and data warehousing applications. Walmart is known to have its main retail transaction data warehouse at 100 plus terabytes in a Teradata DBMS system.

Most of the growth in data is in “files”, not in DBMS. Now we see huge volumes of data in social networking sites like Facebook. At the beginning of 2010, Facebook was handling more than 4TB per day (compressed). Now that it has gone to 750M users, that number is at least 50% more. The new Zuck’s (Zuckerberg) law is , “Shared contents double every 24 months”. The question is how to deal with such volumes.

Google pioneered the algorithm called MapReduce to process massive amounts of data via parallel processing through hundreds of thousands of commodity servers. A simple Google query you type, probably touches 700 to 1000 servers to yield that half-second response time. MapReduce was made an open source under the Apache umbrella and was released as Hadoop (by Doug Cutting, former Xerox Parc, Apple, now at Cloudera). Hadoop has a file store called HDFS besides the MapReduce computational process. Hadoop therefore is a “flexible and available architecture for large scale computation and data processing on a network of commodity servers”. What is Redhat to Linux is Cloudera (new VC funded company) to Hadoop.

While Hadoop is becoming a defacto standard for big data, it’s pedigree is batch. For near-real-time analytics, better answers are needed. Yahoo, for example, has a real time analytics project called S4. Several other innovations are happening in this area of realtime or near realtime analytics. Visualization is another hot area for big data.

Big Data offers many opportunities for innovation in next few years.

Read the original blog entry...

More Stories By Jnan Dash

Jnan Dash is Senior Advisor at EZShield Inc., Advisor at ScaleDB and Board Member at Compassites Software Solutions. He has lived in Silicon Valley since 1979. Formerly he was the Chief Strategy Officer (Consulting) at Curl Inc., before which he spent ten years at Oracle Corporation and was the Group Vice President, Systems Architecture and Technology till 2002. He was responsible for setting Oracle's core database and application server product directions and interacted with customers worldwide in translating future needs to product plans. Before that he spent 16 years at IBM. He blogs at http://jnandash.ulitzer.com.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
Apache Hadoop is emerging as a distributed platform for handling large and fast incoming streams of data. Predictive maintenance, supply chain optimization, and Internet-of-Things analysis are examples where Hadoop provides the scalable storage, processing, and analytics platform to gain meaningful insights from granular data that is typically only valuable from a large-scale, aggregate view. One architecture useful for capturing and analyzing streaming data is the Lambda Architecture, represent...
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
Things are changing so quickly in IoT that it would take a wizard to predict which ecosystem will gain the most traction. In order for IoT to reach its potential, smart devices must be able to work together. Today, there are a slew of interoperability standards being promoted by big names to make this happen: HomeKit, Brillo and Alljoyn. In his session at @ThingsExpo, Adam Justice, vice president and general manager of Grid Connect, will review what happens when smart devices don’t work togethe...
SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), will provide an overview of various initiatives to certifiy the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldw...
SYS-CON Events announced today that Technologic Systems Inc., an embedded systems solutions company, will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Technologic Systems is an embedded systems company with headquarters in Fountain Hills, Arizona. They have been in business for 32 years, helping more than 8,000 OEM customers and building over a hundred COTS products that have never been discontinued. Technologic Systems’ pr...
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
The taxi industry never saw Uber coming. Startups are a threat to incumbents like never before, and a major enabler for startups is that they are instantly “cloud ready.” If innovation moves at the pace of IT, then your company is in trouble. Why? Because your data center will not keep up with frenetic pace AWS, Microsoft and Google are rolling out new capabilities In his session at 20th Cloud Expo, Don Browning, VP of Cloud Architecture at Turner, will posit that disruption is inevitable for c...
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
The explosion of new web/cloud/IoT-based applications and the data they generate are transforming our world right before our eyes. In this rush to adopt these new technologies, organizations are often ignoring fundamental questions concerning who owns the data and failing to ask for permission to conduct invasive surveillance of their customers. Organizations that are not transparent about how their systems gather data telemetry without offering shared data ownership risk product rejection, regu...
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buyers...
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
SYS-CON Events announced today that Infranics will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Since 2000, Infranics has developed SysMaster Suite, which is required for the stable and efficient management of ICT infrastructure. The ICT management solution developed and provided by Infranics continues to add intelligence to the ICT infrastructure through the IMC (Infra Management Cycle) based on mathemat...
SYS-CON Events announced today that SD Times | BZ Media has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. BZ Media LLC is a high-tech media company that produces technical conferences and expositions, and publishes a magazine, newsletters and websites in the software development, SharePoint, mobile development and commercial UAV markets.
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
"I think that everyone recognizes that for IoT to really realize its full potential and value that it is about creating ecosystems and marketplaces and that no single vendor is able to support what is required," explained Esmeralda Swartz, VP, Marketing Enterprise and Cloud at Ericsson, in this SYS-CON.tv interview at @ThingsExpo, held June 7-9, 2016, at the Javits Center in New York City, NY.