Welcome!

Apache Authors: Mark R. Hinkle, Carmen Gonzalez, Roger Strukhoff, Liz McMillan, Elizabeth White

Blog Feed Post

Quick History 2: GLMs, R and large data sets

by Joseph Rickert In last week’s post, I sketched out the history of Generalized Linear Models and their implementations. In this post I’ll attempt to outline how GLM functions evolved in R to handle large data sets. The first function to make it possible to build GLM models with datasets that are too big to fit into memory was the bigglm()from Thomas Lumley’s biglm package which was released to CRAN in May 2006. bigglm()is an example of a external memory or “chunking” algorithm. This means that data is read from some source on disk and processed one chunk at a time. Conceptually, chunking algorithms work as follows: a program reads a chunk of data into memory, performs intermediate calculations to compute the required sufficient statistics, saves the results and reads the next chunk. The process continues until the entire dataset is processed. Then, if necessary, the intermediate results are assembled into a final result. According to the documentation trail, bigglm()is based on Alan Miller’s 1991 refinement (algorithm AS 274 implemented in Fortran 77) to W. Morevin Genetlemen’s 1975 Algol algorithm ( AS 75). Both of these algorithms work by updating the Cholesky decomposition of the design matrix with new observations. For a model with p variables, only the p x p triangular Cholesky factor and a new row of data need to be in memory at any given time. bigglm()does not do the chunking for you. Working with the algorithm requires figuring out how to feed it chunks of data from a file or a database that are small enough to fit into memory with enough room left for processing. ( Have a look at the make.data() function defined on page 4 of the biglm pdf for the prototype example of chunking by passing a function to bigglm()’s data argument.) bigglm() and the biglm package offer few features for working with data. For example, bigglm() can handle factors but it assumes that the factor levels are consistent across all chunks. This is very reasonable under the assumption that the appropriate place to clean and prepare the data for analysis is the underlying database. The next steps in the evolution of building GLM models with R was the development of memory-mapped data structures along with the appropriate machinery to feed bigglm() data stored on disk. In late 2007, Dan Alder et al. released the ff package which provides data structures that, from R's point of view, make data residing on disk appear is if it were in RAM. The basic idea is that only a chunk (pagesize) of the underlying data file is mapped into memory and this data can be fed to bigglm(). This strategy really became useful in 2011 when Edwin de Jonge, Jan Wijffels and Jan van der Laan released ffbase, a package of statistical functions designed to exploit ff’s data structures. ffbase contains quite a few functions including some for basic data manipulation such as ffappend() and ffmatch(). For an excellent example of building a bigglm() model with a fairly large data set have a look at the post from the folks at BNOSAC. This is one of the most useful, hands-on posts with working code for building models with R and large data sets to be found. (It may be a testimony to the power of provocation.) Not longer after ff debuted (June 2008), Michael Kane, John Emerson and Peter Haverty released bigmemory, a package for working with large matrices backed by memory-mapped files. Thereafter followed a sequence of packages in the Big Memory Project, including biganalytics, for exploiting the computational possibilities opened by by bigmemory. bigmemory packages are built on the Boost Interprocess C++ library and were designed to facilitate parallel programming with foreach, snow, Rmpi and multicore and enable distributed computing from within R. The biganalytics package contains a wrapper function for bigglm() that enables building GLM models from very large files mapped to big.matrix objects with just a few lines of code. The initial release in early August 2010 of the RevoScaleR package for Revolution R Enterprise included rxLogit(), a function for building logistic regression models on very masive data sets. rxLogit() was one of the first of RevoScaleR’s Parallel External Memory Algorithms (PEMA). These algorithms are designed specifically for high performance computing with large data sets on a variety of distributed platforms. In June 2012, Revolution Analytics followed up with rxGlm(), a PEMA that implements all of the all of the standard GLM link/family pairs as well as Tweedie models and user-defined link functions. As with all of the PEMAS, scripts including rxGlm() may be run on different platforms just by changing a few lines of code that specifies the user’s compute context. For example, a statistician could test out a model on a local PC or cluster and then change the compute context to run it directly on a Hadoop cluster. The only other Big Data GLM implementation accessible through an R package of which I am aware is h20.glm() function that is part of the 0xdata’s JVM implementation of machine learning algorithms which was announced in October 2013.  As opposed the the external memory R implementations described above, H20 functions run in the distributed memory created by the H20 process. Look here for h20.glm() demo code. And that's it, I think this brings us up to date with R based (or accessible) functions for running GLMs on large data sets.  

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@ThingsExpo Stories
The Internet of Things will greatly expand the opportunities for data collection and new business models driven off of that data. In her session at Internet of @ThingsExpo, Esmeralda Swartz, CMO of MetraTech, will discuss how for this to be effective you not only need to have infrastructure and operational models capable of utilizing this new phenomenon, but increasingly service providers will need to convince a skeptical public to participate. Get ready to show them the money! Speaker Bio: Esmeralda Swartz, CMO of MetraTech, has spent 16 years as a marketing, product management, and busin...
Samsung VP Jacopo Lenzi, who headed the company's recent SmartThings acquisition under the auspices of Samsung's Open Innovaction Center (OIC), answered a few questions we had about the deal. This interview was in conjunction with our interview with SmartThings CEO Alex Hawkinson. IoT Journal: SmartThings was developed in an open, standards-agnostic platform, and will now be part of Samsung's Open Innovation Center. Can you elaborate on your commitment to keep the platform open? Jacopo Lenzi: Samsung recognizes that true, accelerated innovation cannot be driven from one source, but requires a...
SYS-CON Events announced today that Red Hat, the world's leading provider of open source solutions, will exhibit at Internet of @ThingsExpo, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Red Hat is the world's leading provider of open source software solutions, using a community-powered approach to reliable and high-performing cloud, Linux, middleware, storage and virtualization technologies. Red Hat also offers award-winning support, training, and consulting services. As the connective hub in a global network of enterprises, partners, a...
P2P RTC will impact the landscape of communications, shifting from traditional telephony style communications models to OTT (Over-The-Top) cloud assisted & PaaS (Platform as a Service) communication services. The P2P shift will impact many areas of our lives, from mobile communication, human interactive web services, RTC and telephony infrastructure, user federation, security and privacy implications, business costs, and scalability. In his session at Internet of @ThingsExpo, Robin Raymond, Chief Architect at Hookflash Inc., will walk through the shifting landscape of traditional telephone a...
BSQUARE is a global leader of embedded software solutions. We enable smart connected systems at the device level and beyond that millions use every day and provide actionable data solutions for the growing Internet of Things (IoT) market. We empower our world-class customers with our products, services and solutions to achieve innovation and success. For more information, visit www.bsquare.com.
SYS-CON Events announced today that Matrix.org has been named “Silver Sponsor” of Internet of @ThingsExpo, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Matrix is an ambitious new open standard for open, distributed, real-time communication over IP. It defines a new approach for interoperable Instant Messaging and VoIP based on pragmatic HTTP APIs and WebRTC, and provides open source reference implementations to showcase and bootstrap the new standard. Our focus is on simplicity, security, and supporting the fullest feature set.
How do APIs and IoT relate? The answer is not as simple as merely adding an API on top of a dumb device, but rather about understanding the architectural patterns for implementing an IoT fabric. There are typically two or three trends: Exposing the device to a management framework Exposing that management framework to a business centric logic • Exposing that business layer and data to end users. This last trend is the IoT stack, which involves a new shift in the separation of what stuff happens, where data lives and where the interface lies. For instance, it’s a mix of architectural style...
SYS-CON Events announced today that SOA Software, an API management leader, will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. SOA Software is a leading provider of API Management and SOA Governance products that equip business to deliver APIs and SOA together to drive their company to meet its business strategy quickly and effectively. SOA Software’s technology helps businesses to accelerate their digital channels with APIs, drive partner adoption, monetize their assets, and achieve a...
From a software development perspective IoT is about programming "things," about connecting them with each other or integrating them with existing applications. In his session at @ThingsExpo, Yakov Fain, co-founder of Farata Systems and SuranceBay, will show you how small IoT-enabled devices from multiple manufacturers can be integrated into the workflow of an enterprise application. This is a practical demo of building a framework and components in HTML/Java/Mobile technologies to serve as a platform that can integrate new devices as they become available on the market.
SYS-CON Events announced today that Utimaco will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Utimaco is a leading manufacturer of hardware based security solutions that provide the root of trust to keep cryptographic keys safe, secure critical digital infrastructures and protect high value data assets. Only Utimaco delivers a general-purpose hardware security module (HSM) as a customizable platform to easily integrate into existing software solutions, embed business logic and build s...
Connected devices are changing the way we go about our everyday life, from wearables to driverless cars, to smart grids and entire industries revolutionizing business opportunities through smart objects, capable of two-way communication. But what happens when objects are given an IP-address, and we rely on that connection, sometimes with our lives? How do we secure those vast data infrastructures and safe-keep the privacy of sensitive information? This session will outline how each and every connected device can uphold a core root of trust via a unique cryptographic signature – a “bir...
Internet of @ThingsExpo Silicon Valley announced on Thursday its first 12 all-star speakers and sessions for its upcoming event, which will take place November 4-6, 2014, at the Santa Clara Convention Center in California. @ThingsExpo, the first and largest IoT event in the world, debuted at the Javits Center in New York City in June 10-12, 2014 with over 6,000 delegates attending the conference. Among the first 12 announced world class speakers, IBM will present two highly popular IoT sessions, which will take place November 4-6, 2014 at the Santa Clara Convention Center in Santa Clara, Calif...
Almost everyone sees the potential of Internet of Things but how can businesses truly unlock that potential. The key will be in the ability to discover business insight in the midst of an ocean of Big Data generated from billions of embedded devices via Systems of Discover. Businesses will also need to ensure that they can sustain that insight by leveraging the cloud for global reach, scale and elasticity.
WebRTC defines no default signaling protocol, causing fragmentation between WebRTC silos. SIP and XMPP provide possibilities, but come with considerable complexity and are not designed for use in a web environment. In his session at Internet of @ThingsExpo, Matthew Hodgson, technical co-founder of the Matrix.org, will discuss how Matrix is a new non-profit Open Source Project that defines both a new HTTP-based standard for VoIP & IM signaling and provides reference implementations.

SUNNYVALE, Calif., Oct. 20, 2014 /PRNewswire/ -- Spansion Inc. (NYSE: CODE), a global leader in embedded systems, today added 96 new products to the Spansion® FM4 Family of flexible microcontrollers (MCUs). Based on the ARM® Cortex®-M4F core, the new MCUs boast a 200 MHz operating frequency and support a diverse set of on-chip peripherals for enhanced human machine interfaces (HMIs) and machine-to-machine (M2M) communications. The rich set of periphera...

SYS-CON Events announced today that Aria Systems, the recurring revenue expert, has been named "Bronze Sponsor" of SYS-CON's 15th International Cloud Expo®, which will take place on November 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Aria Systems helps leading businesses connect their customers with the products and services they love. Industry leaders like Pitney Bowes, Experian, AAA NCNU, VMware, HootSuite and many others choose Aria to power their recurring revenue business and deliver exceptional experiences to their customers.
The Internet of Things (IoT) is going to require a new way of thinking and of developing software for speed, security and innovation. This requires IT leaders to balance business as usual while anticipating for the next market and technology trends. Cloud provides the right IT asset portfolio to help today’s IT leaders manage the old and prepare for the new. Today the cloud conversation is evolving from private and public to hybrid. This session will provide use cases and insights to reinforce the value of the network in helping organizations to maximize their company’s cloud experience.
The Internet of Things (IoT) is making everything it touches smarter – smart devices, smart cars and smart cities. And lucky us, we’re just beginning to reap the benefits as we work toward a networked society. However, this technology-driven innovation is impacting more than just individuals. The IoT has an environmental impact as well, which brings us to the theme of this month’s #IoTuesday Twitter chat. The ability to remove inefficiencies through connected objects is driving change throughout every sector, including waste management. BigBelly Solar, located just outside of Boston, is trans...
SYS-CON Events announced today that Matrix.org has been named “Silver Sponsor” of Internet of @ThingsExpo, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Matrix is an ambitious new open standard for open, distributed, real-time communication over IP. It defines a new approach for interoperable Instant Messaging and VoIP based on pragmatic HTTP APIs and WebRTC, and provides open source reference implementations to showcase and bootstrap the new standard. Our focus is on simplicity, security, and supporting the fullest feature set.
Predicted by Gartner to add $1.9 trillion to the global economy by 2020, the Internet of Everything (IoE) is based on the idea that devices, systems and services will connect in simple, transparent ways, enabling seamless interactions among devices across brands and sectors. As this vision unfolds, it is clear that no single company can accomplish the level of interoperability required to support the horizontal aspects of the IoE. The AllSeen Alliance, announced in December 2013, was formed with the goal to advance IoE adoption and innovation in the connected home, healthcare, education, aut...