Click here to close now.


Apache Authors: Dana Gardner, Liz McMillan, Mohamed El-Refaey, Ajay Budhraja, Don MacVittie

Blog Feed Post

Book review: "Doing Data Science" by Rachel Schutt and Cathy O'Neil

by Joseph Rickert Every once in a while a single book comes to crystallize a new discipline. If books still have this power in the era of electronic media, "Doing Data Science, Straight Talk from the Frontline" by Rachel Schutt and Cathy O’Neil: O'Reilly, 2013 might just be the book that defines data science. "Doing Data Science", which is based on a course that Rachel taught at Columbia University and to which Cathy contributed, is ambitious and multidimensional. It presents data science in all of its messiness as an open-ended practice that is coalescing around an expanding class of problems; problems which are yielding to an interdisciplinary approach that includes ideas and techniques from statistics, computer science, machine learning, social science and other disciplines. The book is neither a statistics nor a machine learning text, but there are plenty of examples of statistical models and machine learning algorithms. There is enough R code in the text to get a beginner started on real problems with tools that are immediately useful. There is Python code, a bash shell script, mention of JSON and a down to earth discussion of Hadoop and MapReduce that many should find valuable. My favorite code example is the bash script (p 105) that fetches an Enron spam file and performs some basic word count calculations. Its almost casual insertion into the text, without fanfare and little explanation, provides a low key example of the kinds of baseline IT/ programmer skills that a newly minted statistician must acquire in order to work effectively as a data scientist. "Doing Data Science" is fairly well balanced in its fusion of the statistics and machine learning world views, but Rachel’s underlying bias as a PhD statistician comes through when it counts. The grounding in linear models and the inclusion of time series models establish the required inferential skills. The discussion of causality shows how statistical inference is essential to obtaining a deep understanding of how things really work, and the chapter on epidemiology provides a glimpse into just how deep and difficult are the problems that statisticians have been wrestling with for generations. (I found the inclusion of this chapter in a data science book to be a delightful surprise.) It is not only the selection of material, however, that betrays the book's statistical bias. When the authors take on the big questions their language indicates a statistical mindset. For example, in the discussion following "In what sense does data science deserve the word “science” in its name?" (p114) the authors write: “Every design choice you make can be formulated as an hypothesis, against which you will use rigorous testing and experimentation to validate or refute”. This is the language of a Neyman/Pearson trained statistician trying to pin down the truth. It stands in stark contrast with the machine learning viewpoint espoused in a quote by Kaggle’s Jeremy Howard who, when asked “Can you see any downside to the data-driven, black-box approach that dominates on Kaggle?”, replies: Some people take the view that you don’t end up with a richer understanding of the problem. But that’s just not true: The algorithms tell you what’s important and what’s not. You might ask why those things are important, but I think that’s less interesting. You end up with a predictive model that works. There is not too much to argue about there. So, whether you are doing science or not might just be in your intentions and point of view. Schutt and O’Neil do a marvelous job of exploring the tension between the quest for understanding and and the blunt success of just getting something that works. An unusual aspect of the book is its attempt to understand data science as a cultural phenomenon and to place the technology in a historical and social context. Most textbooks in mathematics, statistics and science make no mention of how things came to be. Their authors are just under too much pressure to get on with presenting the material to stop and and discuss “just what were those guys thinking?”. But Schutt and O’Neill take the time, and the book is richer for it. Mike Driscoll and Drew Conway, two practitioners who early on recognized that data science is something new, are quoted along with other contemporary data scientists who are shaping the discipline both through their work and how they talk about it. A great strength of the book is its collection of the real-world, big-league examples contributed by the guest lecturers to Rachel’s course.  Doug Perlson of Real Direct, Jake Hofman of Microsoft Research, Brian Dalessandro and Claudia Perlich both of Media6Degrees, Kyle Teague of GetGlue, William Cukierski of Kaggle, David Huffaker of Google, Matt Gattis of, Mark Hansen of Columbia University, Ian Wong of Square, John Kelley of Morningside Analytics and David Madigan, Chair of the Columbia’s Statistics Department, all bring thoughtful presentations of difficult problems with which they have struggled. The perspective and insight of these practicing data scientists and statisticians is invaluable. Claudia Perlich’s discussion of data leakage alone is probably worth the price of the book. A minor fault of the book is the occasional lapse into the hip vulgar. Someone being “pissed off” and talking about a model “that would totally suck” are probably innocuous enough phrases, but describing a vector as “huge ass” doesn’t really contribute to clarity. In a book that stresses communication, language counts. Nevertheless, "Doing Data Science" is a really “good read”. The authors have done a remarkable job of integrating class notes, their respective blogs, and the presentations of the guest speakers into a single, engaging voice that mostly speaks clearly to the reader. I think this book will appeal to a wide audience. Beginners asking the question “How do I get into data science?” will find the book to be a guide that will take them a long way. Accomplished data scientists will find a perspective on their profession that they should appreciate as being both provocative and valuable. "Doing Data Science" argues eloquently for a technology that respects humanist ideals and ethical considerations. We should all be asking "What problems should I be working on?", "Am I doing science or not?", and "What are the social and ethical implications of my work?". Finally, technical managers charged with assembling a data science team, and other interested outsiders, should find the book helpful in getting beyond the hype and and having a look at what it really takes to squeeze insight from data.

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@ThingsExpo Stories
SYS-CON Events announced today that HPM Networks will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. For 20 years, HPM Networks has been integrating technology solutions that solve complex business challenges. HPM Networks has designed solutions for both SMB and enterprise customers throughout the San Francisco Bay Area.
All major researchers estimate there will be tens of billions devices - computers, smartphones, tablets, and sensors - connected to the Internet by 2020. This number will continue to grow at a rapid pace for the next several decades. With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo, November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be.
SYS-CON Events announced today that Pythian, a global IT services company specializing in helping companies leverage disruptive technologies to optimize revenue-generating systems, has been named “Bronze Sponsor” of SYS-CON's 17th Cloud Expo, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Founded in 1997, Pythian is a global IT services company that helps companies compete by adopting disruptive technologies such as cloud, Big Data, advanced analytics, and DevOps to advance innovation and increase agility. Specializing in designing, imple...
Too often with compelling new technologies market participants become overly enamored with that attractiveness of the technology and neglect underlying business drivers. This tendency, what some call the “newest shiny object syndrome,” is understandable given that virtually all of us are heavily engaged in technology. But it is also mistaken. Without concrete business cases driving its deployment, IoT, like many other technologies before it, will fade into obscurity.
With the proliferation of connected devices underpinning new Internet of Things systems, Brandon Schulz, Director of Luxoft IoT – Retail, will be looking at the transformation of the retail customer experience in brick and mortar stores in his session at @ThingsExpo. Questions he will address include: Will beacons drop to the wayside like QR codes, or be a proximity-based profit driver? How will the customer experience change in stores of all types when everything can be instrumented and analyzed? As an area of investment, how might a retail company move towards an innovation methodolo...
Manufacturing connected IoT versions of traditional products requires more than multiple deep technology skills. It also requires a shift in mindset, to realize that connected, sensor-enabled “things” act more like services than what we usually think of as products. In his session at @ThingsExpo, David Friedman, CEO and co-founder of Ayla Networks, will discuss how when sensors start generating detailed real-world data about products and how they’re being used, smart manufacturers can use the data to create additional revenue streams, such as improved warranties or premium features. Or slash...
Contrary to mainstream media attention, the multiple possibilities of how consumer IoT will transform our everyday lives aren’t the only angle of this headline-gaining trend. There’s a huge opportunity for “industrial IoT” and “Smart Cities” to impact the world in the same capacity – especially during critical situations. For example, a community water dam that needs to release water can leverage embedded critical communications logic to alert the appropriate individuals, on the right device, as soon as they are needed to take action.
WebRTC services have already permeated corporate communications in the form of videoconferencing solutions. However, WebRTC has the potential of going beyond and catalyzing a new class of services providing more than calls with capabilities such as mass-scale real-time media broadcasting, enriched and augmented video, person-to-machine and machine-to-machine communications. In his session at @ThingsExpo, Luis Lopez, CEO of Kurento, will introduce the technologies required for implementing these ideas and some early experiments performed in the Kurento open source software community in areas ...
While many app developers are comfortable building apps for the smartphone, there is a whole new world out there. In his session at @ThingsExpo, Narayan Sainaney, Co-founder and CTO of Mojio, will discuss how the business case for connected car apps is growing and, with open platform companies having already done the heavy lifting, there really is no barrier to entry.
As more intelligent IoT applications shift into gear, they’re merging into the ever-increasing traffic flow of the Internet. It won’t be long before we experience bottlenecks, as IoT traffic peaks during rush hours. Organizations that are unprepared will find themselves by the side of the road unable to cross back into the fast lane. As billions of new devices begin to communicate and exchange data – will your infrastructure be scalable enough to handle this new interconnected world?
The Internet of Things is in the early stages of mainstream deployment but it promises to unlock value and rapidly transform how organizations manage, operationalize, and monetize their assets. IoT is a complex structure of hardware, sensors, applications, analytics and devices that need to be able to communicate geographically and across all functions. Once the data is collected from numerous endpoints, the challenge then becomes converting it into actionable insight.
SYS-CON Events announced today that IceWarp will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. IceWarp, the leader of cloud and on-premise messaging, delivers secured email, chat, documents, conferencing and collaboration to today's mobile workforce, all in one unified interface
SYS-CON Events announced today that Micron Technology, Inc., a global leader in advanced semiconductor systems, will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Micron’s broad portfolio of high-performance memory technologies – including DRAM, NAND and NOR Flash – is the basis for solid state drives, modules, multichip packages and other system solutions. Backed by more than 35 years of technology leadership, Micron's memory solutions enable the world's most innovative computing, consumer,...
With the Apple Watch making its way onto wrists all over the world, it’s only a matter of time before it becomes a staple in the workplace. In fact, Forrester reported that 68 percent of technology and business decision-makers characterize wearables as a top priority for 2015. Recognizing their business value early on, was the first to bring ERP to wearables, helping streamline communication across front and back office functions. In his session at @ThingsExpo, Kevin Roberts, GM of Platform at, will discuss the value of business applications on wearable ...
As more and more data is generated from a variety of connected devices, the need to get insights from this data and predict future behavior and trends is increasingly essential for businesses. Real-time stream processing is needed in a variety of different industries such as Manufacturing, Oil and Gas, Automobile, Finance, Online Retail, Smart Grids, and Healthcare. Azure Stream Analytics is a fully managed distributed stream computation service that provides low latency, scalable processing of streaming data in the cloud with an enterprise grade SLA. It features built-in integration with Azur...
SYS-CON Events announced today the Containers & Microservices Bootcamp, being held November 3-4, 2015, in conjunction with 17th Cloud Expo, @ThingsExpo, and @DevOpsSummit at the Santa Clara Convention Center in Santa Clara, CA. This is your chance to get started with the latest technology in the industry. Combined with real-world scenarios and use cases, the Containers and Microservices Bootcamp, led by Janakiram MSV, a Microsoft Regional Director, will include presentations as well as hands-on demos and comprehensive walkthroughs.
17th Cloud Expo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterprises are using some form of XaaS – software, platform, and infrastructure as a service.
SYS-CON Events announced today that the "Second Containers & Microservices Expo" will take place November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities.
Akana has announced the availability of the new Akana Healthcare Solution. The API-driven solution helps healthcare organizations accelerate their transition to being secure, digitally interoperable businesses. It leverages the Health Level Seven International Fast Healthcare Interoperability Resources (HL7 FHIR) standard to enable broader business use of medical data. Akana developed the Healthcare Solution in response to healthcare businesses that want to increase electronic, multi-device access to health records while reducing operating costs and complying with government regulations.
Containers are not new, but renewed commitments to performance, flexibility, and agility have propelled them to the top of the agenda today. By working without the need for virtualization and its overhead, containers are seen as the perfect way to deploy apps and services across multiple clouds. Containers can handle anything from file types to operating systems and services, including microservices. What are microservices? Unlike what the name implies, microservices are not necessarily small, but are focused on specific tasks. The ability for developers to deploy multiple containers – thous...