Apache Authors: Kevin Benedict, Noel Wurst, David Smith, RealWire News Distribution, Max Katz

Blog Feed Post

A new paper and new(ish) R packages

By Jay Emerson and Mike Kane We’re very happy to announce our recent publication with Steve Weston in the Journal of Statistical Software (JSS), “Scalable Strategies for Computing with Massive Data”, JSS Volume 55 Issue 14. In a nutshell: This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP) machine, or in cluster environments without platform-specific code. Second, the bigmemory package implements memory- and file-mapped data structures that provide (a) access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b) data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware. We also welcome Pete Haverty from Genentech as an author on the bigmemory package. Pete and his colleagues at Genentech have made some substantial improvements to the package and are some of the heaviest users of these extensions (at least, to the best of our knowledge).  Secondly, we’d like to announce a new package, BH, with lead author and maintainer Dirk Eddelbuettel (he of Rcpp fame, also the first user and constructive critic of bigmemory).  BH contains a subset of Boost headers used by bigmemory and other packages, some in active development and not yet on CRAN: Boost provides free peer-reviewed portable C++ source libraries. A large part of Boost is provided as C++ template code, which is resolved entirely at compile-time without linking. This package aims to provide the most useful subset of Boost libraries for template use among CRAN package. By placing these libraries in this package, we offer a more efficient distribution system for CRAN as replication of this code in the sources of other packages is avoided. New libraries from Boost may be included upon request (though we limit it to headers only, with no compiled code).  Please visit our new Github site for more information.  Finally, we’d like to call attention to a change in JSS software license policy.  With the publication of “Scalable Strategies for Computing with Massive Data” JSS now accepts software licensed under either GPL-2 or GPL-3.  GPL-3 in turn is compatible with Apache-2.0, and all these licenses are compatible with Boost’s very permissive BSL-1.0 license.  This should help to broaden the software contributions documented and reviewed in JSS, and we are grateful to the Editors of JSS for this shift in policy.

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid