| By Jnan Dash | Article Rating: |
|
| July 11, 2011 04:25 PM EDT | Reads: |
3,499 |
The phrase “Big Data” is thrown around a lot these days. What exactly is referred to by this phrase? When I was part of IBM’s DB2 development team, the largest size limit of a DB2 Table was 64 Gigabytes (GB) and I thought who on earth can use this size of a database. Thirty years later, that number looks so small. Now you can buy a 1 Terabyte external drive for less than $100.
Let us start with a level set on the unit of storage. In multiples of 1000, we go from Byte – Kilobyte (KB) – Megabyte (MB) – Gigabyte (GB) – Terabyte (TB) – Petabyte (PB) – Exabyte (EB) – Zettabyte (ZB) – Yottabyte (YB). The last one YB is 10 to the power of 24. A typed page is 2KB. The entire book collection at the US Library of Congress is 15TB. The amount of data processed in one hour at Google is 1PB. The total amount of information in existence is around 1.27ZB. Now you get some context to these numbers.
When we say Big Data, we enter the petabyte space (1000 Terabytes). There is talk of “personal petabyte” to store all your audio, video, and pictures. The cost has come down from $2M in 2002 to $2K in 2012 – real Moore’s law in disk storage technology here. This is not the stuff for current commercial database products such as DB2 or Oracle or SQLServer. Such RDBMS’s handle maximum of 10 to 100 Terabyte sizes. Anything bigger would cause serious performance nightmares. These large databases are mostly in the decision support and data warehousing applications. Walmart is known to have its main retail transaction data warehouse at 100 plus terabytes in a Teradata DBMS system.
Most of the growth in data is in “files”, not in DBMS. Now we see huge volumes of data in social networking sites like Facebook. At the beginning of 2010, Facebook was handling more than 4TB per day (compressed). Now that it has gone to 750M users, that number is at least 50% more. The new Zuck’s (Zuckerberg) law is , “Shared contents double every 24 months”. The question is how to deal with such volumes.
Google pioneered the algorithm called MapReduce to process massive amounts of data via parallel processing through hundreds of thousands of commodity servers. A simple Google query you type, probably touches 700 to 1000 servers to yield that half-second response time. MapReduce was made an open source under the Apache umbrella and was released as Hadoop (by Doug Cutting, former Xerox Parc, Apple, now at Cloudera). Hadoop has a file store called HDFS besides the MapReduce computational process. Hadoop therefore is a “flexible and available architecture for large scale computation and data processing on a network of commodity servers”. What is Redhat to Linux is Cloudera (new VC funded company) to Hadoop.
While Hadoop is becoming a defacto standard for big data, it’s pedigree is batch. For near-real-time analytics, better answers are needed. Yahoo, for example, has a real time analytics project called S4. Several other innovations are happening in this area of realtime or near realtime analytics. Visualization is another hot area for big data.
Big Data offers many opportunities for innovation in next few years.
Read the original blog entry...
Published July 11, 2011 Reads 3,499
Copyright © 2011 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Jnan Dash
Jnan Dash is Senior Advisor at EZShield Inc., Advisor at ScaleDB and Board Member at Compassites Software Solutions. He has lived in Silicon Valley since 1979. Formerly he was the Chief Strategy Officer (Consulting) at Curl Inc., before which he spent ten years at Oracle Corporation and was the Group Vice President, Systems Architecture and Technology till 2002. He was responsible for setting Oracle's core database and application server product directions and interacted with customers worldwide in translating future needs to product plans. Before that he spent 16 years at IBM. He blogs at http://jnandash.ulitzer.com.
- Cloud People: A Who's Who of Cloud Computing
- Windows Azure IaaS Reaches General Availability
- Predixion Software Announces General Availability of the Latest Version of its Predictive Analytics Platform
- Cloud Expo New York: The Big Challenge of Big Data & Hadoop Integration
- Agile Solutions for Cloud, Big Data, Mobility Services
- MicroStrategy Announces General Availability of MicroStrategy 9.3.1
- Cloud Computing: Cutting Costs, Boosting Profits
- AMAX Launches StorMax(TM) CFS, powered by IBM(R) General Parallel File System(TM) (GPFS(TM))
- Big Data: Visualizing the Strategic Business Imperative
- NIST to Sponsor FFRDC Widespread Adoption of Integrated CyberSecurity
- MicroStrategy Announces General Availability of MicroStrategy 9.3.1
- Benefits of Cloud Computing
- Cloud People: A Who's Who of Cloud Computing
- Windows Azure IaaS Reaches General Availability
- Portable Experimenter’s Platform, Powered by Raspberry Pi
- Predixion Software Announces General Availability of the Latest Version of its Predictive Analytics Platform
- SUSE Receives Common Criteria Security Certifications
- Basho Announces Open Source Riak CS and General Availability of Riak CS Enterprise v1.3
- Cloud Expo New York: Big Time - Introducing Hadoop on Azure
- Cloud Expo New York: Real-Time Analytics Using an In-Memory Data Grid
- Cloud Expo New York: The Big Challenge of Big Data & Hadoop Integration
- Help Desk Solution Empowers Employees
- Public Cloud’s Got a Silver Lining: Gartner
- Agile Solutions for Cloud, Big Data, Mobility Services
- The Top 250 Players in the Cloud Computing Ecosystem
- Web Services Using ColdFusion and Apache CXF
- Cloud People: A Who's Who of Cloud Computing
- Red Hat Named "Platinum Sponsor" of Virtualization Conference & Expo
- Cloud Expo New York Call for Papers Now Open
- Eclipse "Pollinate" Project to Integrate with Apache Beehive
- An Introduction to Ant
- Cloud Expo 2011 East To Attract 10,000 Delegates and 200 Exhibitors
- Beehive Code Now Available in Apache
- 4th International Cloud Computing Conference & Expo Starts Today
- Apache's Tomcat 5.5 is First Release Ever to Use Eclipse JDT Java Compiler
- "Beehive" Now Officially an Open Source Project: Apache Beehive

























