Welcome!

Apache Authors: Liz McMillan, Pat Romanski, Elizabeth White, Christopher Harrold, Janakiram MSV

Blog Feed Post

Quick History 2: GLMs, R and large data sets

by Joseph Rickert In last week’s post, I sketched out the history of Generalized Linear Models and their implementations. In this post I’ll attempt to outline how GLM functions evolved in R to handle large data sets. The first function to make it possible to build GLM models with datasets that are too big to fit into memory was the bigglm()from Thomas Lumley’s biglm package which was released to CRAN in May 2006. bigglm()is an example of a external memory or “chunking” algorithm. This means that data is read from some source on disk and processed one chunk at a time. Conceptually, chunking algorithms work as follows: a program reads a chunk of data into memory, performs intermediate calculations to compute the required sufficient statistics, saves the results and reads the next chunk. The process continues until the entire dataset is processed. Then, if necessary, the intermediate results are assembled into a final result. According to the documentation trail, bigglm()is based on Alan Miller’s 1991 refinement (algorithm AS 274 implemented in Fortran 77) to W. Morevin Genetlemen’s 1975 Algol algorithm ( AS 75). Both of these algorithms work by updating the Cholesky decomposition of the design matrix with new observations. For a model with p variables, only the p x p triangular Cholesky factor and a new row of data need to be in memory at any given time. bigglm()does not do the chunking for you. Working with the algorithm requires figuring out how to feed it chunks of data from a file or a database that are small enough to fit into memory with enough room left for processing. ( Have a look at the make.data() function defined on page 4 of the biglm pdf for the prototype example of chunking by passing a function to bigglm()’s data argument.) bigglm() and the biglm package offer few features for working with data. For example, bigglm() can handle factors but it assumes that the factor levels are consistent across all chunks. This is very reasonable under the assumption that the appropriate place to clean and prepare the data for analysis is the underlying database. The next steps in the evolution of building GLM models with R was the development of memory-mapped data structures along with the appropriate machinery to feed bigglm() data stored on disk. In late 2007, Dan Alder et al. released the ff package which provides data structures that, from R's point of view, make data residing on disk appear is if it were in RAM. The basic idea is that only a chunk (pagesize) of the underlying data file is mapped into memory and this data can be fed to bigglm(). This strategy really became useful in 2011 when Edwin de Jonge, Jan Wijffels and Jan van der Laan released ffbase, a package of statistical functions designed to exploit ff’s data structures. ffbase contains quite a few functions including some for basic data manipulation such as ffappend() and ffmatch(). For an excellent example of building a bigglm() model with a fairly large data set have a look at the post from the folks at BNOSAC. This is one of the most useful, hands-on posts with working code for building models with R and large data sets to be found. (It may be a testimony to the power of provocation.) Not longer after ff debuted (June 2008), Michael Kane, John Emerson and Peter Haverty released bigmemory, a package for working with large matrices backed by memory-mapped files. Thereafter followed a sequence of packages in the Big Memory Project, including biganalytics, for exploiting the computational possibilities opened by by bigmemory. bigmemory packages are built on the Boost Interprocess C++ library and were designed to facilitate parallel programming with foreach, snow, Rmpi and multicore and enable distributed computing from within R. The biganalytics package contains a wrapper function for bigglm() that enables building GLM models from very large files mapped to big.matrix objects with just a few lines of code. The initial release in early August 2010 of the RevoScaleR package for Revolution R Enterprise included rxLogit(), a function for building logistic regression models on very masive data sets. rxLogit() was one of the first of RevoScaleR’s Parallel External Memory Algorithms (PEMA). These algorithms are designed specifically for high performance computing with large data sets on a variety of distributed platforms. In June 2012, Revolution Analytics followed up with rxGlm(), a PEMA that implements all of the all of the standard GLM link/family pairs as well as Tweedie models and user-defined link functions. As with all of the PEMAS, scripts including rxGlm() may be run on different platforms just by changing a few lines of code that specifies the user’s compute context. For example, a statistician could test out a model on a local PC or cluster and then change the compute context to run it directly on a Hadoop cluster. The only other Big Data GLM implementation accessible through an R package of which I am aware is h20.glm() function that is part of the 0xdata’s JVM implementation of machine learning algorithms which was announced in October 2013.  As opposed the the external memory R implementations described above, H20 functions run in the distributed memory created by the H20 process. Look here for h20.glm() demo code. And that's it, I think this brings us up to date with R based (or accessible) functions for running GLMs on large data sets.  

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@ThingsExpo Stories
Apache Hadoop is emerging as a distributed platform for handling large and fast incoming streams of data. Predictive maintenance, supply chain optimization, and Internet-of-Things analysis are examples where Hadoop provides the scalable storage, processing, and analytics platform to gain meaningful insights from granular data that is typically only valuable from a large-scale, aggregate view. One architecture useful for capturing and analyzing streaming data is the Lambda Architecture, represent...
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
Things are changing so quickly in IoT that it would take a wizard to predict which ecosystem will gain the most traction. In order for IoT to reach its potential, smart devices must be able to work together. Today, there are a slew of interoperability standards being promoted by big names to make this happen: HomeKit, Brillo and Alljoyn. In his session at @ThingsExpo, Adam Justice, vice president and general manager of Grid Connect, will review what happens when smart devices don’t work togethe...
SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), will provide an overview of various initiatives to certifiy the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldw...
SYS-CON Events announced today that Technologic Systems Inc., an embedded systems solutions company, will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Technologic Systems is an embedded systems company with headquarters in Fountain Hills, Arizona. They have been in business for 32 years, helping more than 8,000 OEM customers and building over a hundred COTS products that have never been discontinued. Technologic Systems’ pr...
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
The taxi industry never saw Uber coming. Startups are a threat to incumbents like never before, and a major enabler for startups is that they are instantly “cloud ready.” If innovation moves at the pace of IT, then your company is in trouble. Why? Because your data center will not keep up with frenetic pace AWS, Microsoft and Google are rolling out new capabilities In his session at 20th Cloud Expo, Don Browning, VP of Cloud Architecture at Turner, will posit that disruption is inevitable for c...
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
The explosion of new web/cloud/IoT-based applications and the data they generate are transforming our world right before our eyes. In this rush to adopt these new technologies, organizations are often ignoring fundamental questions concerning who owns the data and failing to ask for permission to conduct invasive surveillance of their customers. Organizations that are not transparent about how their systems gather data telemetry without offering shared data ownership risk product rejection, regu...
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buyers...
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
SYS-CON Events announced today that Infranics will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Since 2000, Infranics has developed SysMaster Suite, which is required for the stable and efficient management of ICT infrastructure. The ICT management solution developed and provided by Infranics continues to add intelligence to the ICT infrastructure through the IMC (Infra Management Cycle) based on mathemat...
SYS-CON Events announced today that SD Times | BZ Media has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. BZ Media LLC is a high-tech media company that produces technical conferences and expositions, and publishes a magazine, newsletters and websites in the software development, SharePoint, mobile development and commercial UAV markets.
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
"I think that everyone recognizes that for IoT to really realize its full potential and value that it is about creating ecosystems and marketplaces and that no single vendor is able to support what is required," explained Esmeralda Swartz, VP, Marketing Enterprise and Cloud at Ericsson, in this SYS-CON.tv interview at @ThingsExpo, held June 7-9, 2016, at the Javits Center in New York City, NY.