Welcome!

Apache Authors: Carmen Gonzalez, Liz McMillan, Elizabeth White, Pat Romanski, Christopher Harrold

Related Topics: @DXWorldExpo, Microservices Expo, Microsoft Cloud, @CloudExpo, Apache, SDN Journal

@DXWorldExpo: Article

Five Big Data Features in DB2 Databases

Traditional RDBMS and Big Data

Traditional RDBMS & New Data Processing
Over the past two decades relational databases have been most successful in serving large-scale OLTP and OLAP applications across enterprises. However, in the past couple of years with the advent of Big Data processing, especially processing unstructured data coupled with the need for processing massive quantities of data, the industry started to look into non RDBMS solutions. This has lead into the popularity of NoSQL databases as well as massively parallel processing frameworks.

However the traditional RDBMS has been quick to react and added several Big Data features as part of their offering such that the enterprises with the heavy investment of traditional RDBMS can have best of both worlds by properly leveraging these new features.

The following sections provide idea about big data features in the popular DB2 Databases, a similar analysis will be performed against Oracle also in a later article. Please refer to my earlier article on Five Big Data Features in SQL Server.

1. DB2 Text Search
DB2 Text Search provides extensive capabilities for searching data in text columns stored in a DB2 table. The search system provides fast query response times and a consolidated, ranked result set that enables you to quickly and easily locate the information that you need. By incorporating the functions of DB2 Text Search in your SQL and XQuery statements, you can create powerful and versatile text-retrieval programs.

DB2 Text Search works by collecting data from diverse sources and indexing it for subsequent fast retrieval. DB2 Text Search uses linguistic analysis to improve search results and supports the following document formats:

  • Unstructured plain text.
  • Structured text such as that in HTML or XML documents
  • Proprietary document formats such as PDF or Microsoft Office document formats.

We can perform various kinds of searches like,

  • Basic Search : Using Boolean Operators and Modifiers
  • Fuzzy Search : Using words with similar spelling to search term
  • Proximity Search : A proximity search retrieves documents that contain search words which are located within a specified distance from each other.
  • SCORE Search : We use the SCORE function to find out the extent to which a document matches a search document.

DB2® Text Search provides dictionary packs to support the linguistic processing of documents and queries. In addition, n-gram segmentation is supported for languages such as Chinese, Japanese, and Korean. As an alternative to dictionary-based word segmentation, the search engine provides an option to select n-gram segmentation for languages such as Chinese, Japanese, and Korean. It is evident from the use cases and patterns on Big Data without such features on natural language processing much of the insights like sentiment analysis cannot be fruitful.

2. Partitioned Databases
MPP (Massively Parallel Processing) frame works like Hadoop are found to be well suited for processing large quantities of data due to their Shared Nothing Architecture and the ability to process data in parallel. DB2 on Unix/Windows is the pioneer in implementing such a concept using the partitioned database option.

A partitioned database environment is a database installation that supports the distribution of data across database partitions. Because data is distributed across database partitions, you can use the power of multiple processors on multiple physical machines to satisfy requests for information. Data retrieval and update requests are decomposed automatically into sub-requests, and executed in parallel among the applicable database partitions. The fact that databases are split across database partitions is transparent to users issuing SQL statements. Interpartition parallelism refers to the ability to break up a query into multiple parts across multiple partitions of a partitioned database, on one machine or multiple machines. The query is run in parallel. Some DB2 utilities also perform this type of parallelism.

In support of Unstructured Big Data processing, DB2 Text Search explained earlier is integrated with the partitioned database environment. DB2® Text Search supports full-text search in a partitioned database environment. Text search indexes are distributed in a pattern that matches the base tables on which they are created. For each database partition, a text index partition, also called a collection, is created. This pattern facilitates text search maintenance by allowing text search index updates with parallel execution on all index partitions.

3. Pure XML
The pureXML® feature allows you to store well-formed XML documents in database table columns that have the XML data type. By storing XML data in XML columns, the data is kept in its native hierarchical form, rather than stored as text or mapped to a different data model.

There is no architectural limit on the size of an XML value in a database. An index over XML data can be used to improve the efficiency of queries on XML documents that are stored in an XML column. In contrast to traditional relational indexes, where index keys are composed of one or more table columns you specify, an index over XML data uses a particular XML pattern expression to index paths and values in XML documents stored within a single column. The data type of that column must be XML.

In partitioned database environments, tables containing XML columns can be stored in multi-partition databases. In DB2 latest version, the pureXML feature is supported in partitioned database environments. With both features tightly integrated, pureXML customers can distribute XML data across multiple database partitions and parallelize XML queries for better performance, while partitioned database environments customers can deploy pureXML for new business applications.

The above combination of processing large XML documents in a parallel environment make a best case for DB2 used for big data processing.

4. DB2 Federation

One of the important needs of big data processing is the need to connect to multiple disparate data sources and bring the best out of them. Enterprises no longer can afford to have a single common data store for all their data processing needs.

In DB2 a federated system is a type of distributed database management system that you can use to access data sources across your enterprise. As documented in the IBM Documentation site, DB2 federation support almost all kinds of structured and unstructured data sources. In particular there is support for flat files, Microsoft Excel and VSAM files.

One interesting component of DB2 federation is, the support for connecting to Netezza DB. Netezza is the high performance data warehouse appliance . IBM® Netezza® Analytics is an embedded, purpose-built, advanced analytics platform .

5. Pure Scale
While the Shared Nothing Architecture has been a standard for many massively parallel processing environments, there are successful architectures using Shared Disk model too. The major examples being the IBM's Mainframe Parallel SYSPLEX and Oracle Real Application Clusters.

With the DB2 pureScale Feature, scaling your database solution is simple. Multiple database servers, known as members, process incoming database requests; these members operate in a clustered system and share data. You can transparently add more members to scale out to meet even the most demanding business needs. There are no application changes to make, data to redistribute, or performance tuning to do. The IBM® DB2® pureScale® Feature, much like a multi-partition database environment, provides a scalable and highly available database solution. However, the instance type and data layout of a DB2 pureScale environment and a multi-partition database environment are different.

A DB2 pureScale environment is ideal for short transactions where there is little need to parallelize each query. Queries are automatically routed to different members, based on member workload. While this is not a ideal work load in a Big Data processing scenario , but Big Data Environments do invest on options like Hbase, Cassandra to process short transactions.

Summary
Traditional high performance RDBMS like DB2 have their strengths. They are very strong in maintaining the data integrity and quality in the form of constraints, foreign keys and other validation mechanisms. They are also strong in transactional integrity by providing superior locking model, automatic dead lock resolution etc.. However initially they are not found to adjust to Big Data processing needs of enterprises.

With the enhancements in the products made by respective vendors, now databases like DB2 have been enhanced with big data processing features and makes them the best candidate for enterprises looking for best of the breed features between traditional RDBMS and Big Data processing systems, and to leverage the best of existing investments.

More Stories By Srinivasan Sundara Rajan

Highly passionate about utilizing Digital Technologies to enable next generation enterprise. Believes in enterprise transformation through the Natives (Cloud Native & Mobile Native).

IoT & Smart Cities Stories
While the focus and objectives of IoT initiatives are many and diverse, they all share a few common attributes, and one of those is the network. Commonly, that network includes the Internet, over which there isn't any real control for performance and availability. Or is there? The current state of the art for Big Data analytics, as applied to network telemetry, offers new opportunities for improving and assuring operational integrity. In his session at @ThingsExpo, Jim Frey, Vice President of S...
@CloudEXPO and @ExpoDX, two of the most influential technology events in the world, have hosted hundreds of sponsors and exhibitors since our launch 10 years ago. @CloudEXPO and @ExpoDX New York and Silicon Valley provide a full year of face-to-face marketing opportunities for your company. Each sponsorship and exhibit package comes with pre and post-show marketing programs. By sponsoring and exhibiting in New York and Silicon Valley, you reach a full complement of decision makers and buyers in ...
The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound e...
Two weeks ago (November 3-5), I attended the Cloud Expo Silicon Valley as a speaker, where I presented on the security and privacy due diligence requirements for cloud solutions. Cloud security is a topical issue for every CIO, CISO, and technology buyer. Decision-makers are always looking for insights on how to mitigate the security risks of implementing and using cloud solutions. Based on the presentation topics covered at the conference, as well as the general discussions heard between sessio...
The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, compared the Jevons Paradox to modern-day enterprise IT, examin...
Rodrigo Coutinho is part of OutSystems' founders' team and currently the Head of Product Design. He provides a cross-functional role where he supports Product Management in defining the positioning and direction of the Agile Platform, while at the same time promoting model-based development and new techniques to deliver applications in the cloud.
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settl...
There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
LogRocket helps product teams develop better experiences for users by recording videos of user sessions with logs and network data. It identifies UX problems and reveals the root cause of every bug. LogRocket presents impactful errors on a website, and how to reproduce it. With LogRocket, users can replay problems.
Data Theorem is a leading provider of modern application security. Its core mission is to analyze and secure any modern application anytime, anywhere. The Data Theorem Analyzer Engine continuously scans APIs and mobile applications in search of security flaws and data privacy gaps. Data Theorem products help organizations build safer applications that maximize data security and brand protection. The company has detected more than 300 million application eavesdropping incidents and currently secu...