Welcome!

Apache Authors: Elizabeth White, Pat Romanski, John Mertic, Liz McMillan, Janakiram MSV

Blog Feed Post

Book review: "Doing Data Science" by Rachel Schutt and Cathy O'Neil

by Joseph Rickert Every once in a while a single book comes to crystallize a new discipline. If books still have this power in the era of electronic media, "Doing Data Science, Straight Talk from the Frontline" by Rachel Schutt and Cathy O’Neil: O'Reilly, 2013 might just be the book that defines data science. "Doing Data Science", which is based on a course that Rachel taught at Columbia University and to which Cathy contributed, is ambitious and multidimensional. It presents data science in all of its messiness as an open-ended practice that is coalescing around an expanding class of problems; problems which are yielding to an interdisciplinary approach that includes ideas and techniques from statistics, computer science, machine learning, social science and other disciplines. The book is neither a statistics nor a machine learning text, but there are plenty of examples of statistical models and machine learning algorithms. There is enough R code in the text to get a beginner started on real problems with tools that are immediately useful. There is Python code, a bash shell script, mention of JSON and a down to earth discussion of Hadoop and MapReduce that many should find valuable. My favorite code example is the bash script (p 105) that fetches an Enron spam file and performs some basic word count calculations. Its almost casual insertion into the text, without fanfare and little explanation, provides a low key example of the kinds of baseline IT/ programmer skills that a newly minted statistician must acquire in order to work effectively as a data scientist. "Doing Data Science" is fairly well balanced in its fusion of the statistics and machine learning world views, but Rachel’s underlying bias as a PhD statistician comes through when it counts. The grounding in linear models and the inclusion of time series models establish the required inferential skills. The discussion of causality shows how statistical inference is essential to obtaining a deep understanding of how things really work, and the chapter on epidemiology provides a glimpse into just how deep and difficult are the problems that statisticians have been wrestling with for generations. (I found the inclusion of this chapter in a data science book to be a delightful surprise.) It is not only the selection of material, however, that betrays the book's statistical bias. When the authors take on the big questions their language indicates a statistical mindset. For example, in the discussion following "In what sense does data science deserve the word “science” in its name?" (p114) the authors write: “Every design choice you make can be formulated as an hypothesis, against which you will use rigorous testing and experimentation to validate or refute”. This is the language of a Neyman/Pearson trained statistician trying to pin down the truth. It stands in stark contrast with the machine learning viewpoint espoused in a quote by Kaggle’s Jeremy Howard who, when asked “Can you see any downside to the data-driven, black-box approach that dominates on Kaggle?”, replies: Some people take the view that you don’t end up with a richer understanding of the problem. But that’s just not true: The algorithms tell you what’s important and what’s not. You might ask why those things are important, but I think that’s less interesting. You end up with a predictive model that works. There is not too much to argue about there. So, whether you are doing science or not might just be in your intentions and point of view. Schutt and O’Neil do a marvelous job of exploring the tension between the quest for understanding and and the blunt success of just getting something that works. An unusual aspect of the book is its attempt to understand data science as a cultural phenomenon and to place the technology in a historical and social context. Most textbooks in mathematics, statistics and science make no mention of how things came to be. Their authors are just under too much pressure to get on with presenting the material to stop and and discuss “just what were those guys thinking?”. But Schutt and O’Neill take the time, and the book is richer for it. Mike Driscoll and Drew Conway, two practitioners who early on recognized that data science is something new, are quoted along with other contemporary data scientists who are shaping the discipline both through their work and how they talk about it. A great strength of the book is its collection of the real-world, big-league examples contributed by the guest lecturers to Rachel’s course.  Doug Perlson of Real Direct, Jake Hofman of Microsoft Research, Brian Dalessandro and Claudia Perlich both of Media6Degrees, Kyle Teague of GetGlue, William Cukierski of Kaggle, David Huffaker of Google, Matt Gattis of Hutch.com, Mark Hansen of Columbia University, Ian Wong of Square, John Kelley of Morningside Analytics and David Madigan, Chair of the Columbia’s Statistics Department, all bring thoughtful presentations of difficult problems with which they have struggled. The perspective and insight of these practicing data scientists and statisticians is invaluable. Claudia Perlich’s discussion of data leakage alone is probably worth the price of the book. A minor fault of the book is the occasional lapse into the hip vulgar. Someone being “pissed off” and talking about a model “that would totally suck” are probably innocuous enough phrases, but describing a vector as “huge ass” doesn’t really contribute to clarity. In a book that stresses communication, language counts. Nevertheless, "Doing Data Science" is a really “good read”. The authors have done a remarkable job of integrating class notes, their respective blogs, and the presentations of the guest speakers into a single, engaging voice that mostly speaks clearly to the reader. I think this book will appeal to a wide audience. Beginners asking the question “How do I get into data science?” will find the book to be a guide that will take them a long way. Accomplished data scientists will find a perspective on their profession that they should appreciate as being both provocative and valuable. "Doing Data Science" argues eloquently for a technology that respects humanist ideals and ethical considerations. We should all be asking "What problems should I be working on?", "Am I doing science or not?", and "What are the social and ethical implications of my work?". Finally, technical managers charged with assembling a data science team, and other interested outsiders, should find the book helpful in getting beyond the hype and and having a look at what it really takes to squeeze insight from data.

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@ThingsExpo Stories
"We're a cybersecurity firm that specializes in engineering security solutions both at the software and hardware level. Security cannot be an after-the-fact afterthought, which is what it's become," stated Richard Blech, Chief Executive Officer at Secure Channels, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
In this strange new world where more and more power is drawn from business technology, companies are effectively straddling two paths on the road to innovation and transformation into digital enterprises. The first path is the heritage trail – with “legacy” technology forming the background. Here, extant technologies are transformed by core IT teams to provide more API-driven approaches. Legacy systems can restrict companies that are transitioning into digital enterprises. To truly become a lead...
Video experiences should be unique and exciting! But that doesn’t mean you need to patch all the pieces yourself. Users demand rich and engaging experiences and new ways to connect with you. But creating robust video applications at scale can be complicated, time-consuming and expensive. In his session at @ThingsExpo, Zohar Babin, Vice President of Platform, Ecosystem and Community at Kaltura, discussed how VPaaS enables you to move fast, creating scalable video experiences that reach your aud...
"Once customers get a year into their IoT deployments, they start to realize that they may have been shortsighted in the ways they built out their deployment and the key thing I see a lot of people looking at is - how can I take equipment data, pull it back in an IoT solution and show it in a dashboard," stated Dave McCarthy, Director of Products at Bsquare Corporation, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
What happens when the different parts of a vehicle become smarter than the vehicle itself? As we move toward the era of smart everything, hundreds of entities in a vehicle that communicate with each other, the vehicle and external systems create a need for identity orchestration so that all entities work as a conglomerate. Much like an orchestra without a conductor, without the ability to secure, control, and connect the link between a vehicle’s head unit, devices, and systems and to manage the ...
An IoT product’s log files speak volumes about what’s happening with your products in the field, pinpointing current and potential issues, and enabling you to predict failures and save millions of dollars in inventory. But until recently, no one knew how to listen. In his session at @ThingsExpo, Dan Gettens, Chief Research Officer at OnProcess, discussed recent research by Massachusetts Institute of Technology and OnProcess Technology, where MIT created a new, breakthrough analytics model for ...
IoT is rapidly changing the way enterprises are using data to improve business decision-making. In order to derive business value, organizations must unlock insights from the data gathered and then act on these. In their session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, and Peter Shashkin, Head of Development Department at EastBanc Technologies, discussed how one organization leveraged IoT, cloud technology and data analysis to improve customer experiences and effici...
Everyone knows that truly innovative companies learn as they go along, pushing boundaries in response to market changes and demands. What's more of a mystery is how to balance innovation on a fresh platform built from scratch with the legacy tech stack, product suite and customers that continue to serve as the business' foundation. In his General Session at 19th Cloud Expo, Michael Chambliss, Head of Engineering at ReadyTalk, discussed why and how ReadyTalk diverted from healthy revenue and mor...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
You have great SaaS business app ideas. You want to turn your idea quickly into a functional and engaging proof of concept. You need to be able to modify it to meet customers' needs, and you need to deliver a complete and secure SaaS application. How could you achieve all the above and yet avoid unforeseen IT requirements that add unnecessary cost and complexity? You also want your app to be responsive in any device at any time. In his session at 19th Cloud Expo, Mark Allen, General Manager of...
The Internet of Things (IoT) promises to simplify and streamline our lives by automating routine tasks that distract us from our goals. This promise is based on the ubiquitous deployment of smart, connected devices that link everything from industrial control systems to automobiles to refrigerators. Unfortunately, comparatively few of the devices currently deployed have been developed with an eye toward security, and as the DDoS attacks of late October 2016 have demonstrated, this oversight can ...
Bert Loomis was a visionary. This general session will highlight how Bert Loomis and people like him inspire us to build great things with small inventions. In their general session at 19th Cloud Expo, Harold Hannon, Architect at IBM Bluemix, and Michael O'Neill, Strategic Business Development at Nvidia, discussed the accelerating pace of AI development and how IBM Cloud and NVIDIA are partnering to bring AI capabilities to "every day," on-demand. They also reviewed two "free infrastructure" pr...
As data explodes in quantity, importance and from new sources, the need for managing and protecting data residing across physical, virtual, and cloud environments grow with it. Managing data includes protecting it, indexing and classifying it for true, long-term management, compliance and E-Discovery. Commvault can ensure this with a single pane of glass solution – whether in a private cloud, a Service Provider delivered public cloud or a hybrid cloud environment – across the heterogeneous enter...
"Dice has been around for the last 20 years. We have been helping tech professionals find new jobs and career opportunities," explained Manish Dixit, VP of Product and Engineering at Dice, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Extracting business value from Internet of Things (IoT) data doesn’t happen overnight. There are several requirements that must be satisfied, including IoT device enablement, data analysis, real-time detection of complex events and automated orchestration of actions. Unfortunately, too many companies fall short in achieving their business goals by implementing incomplete solutions or not focusing on tangible use cases. In his general session at @ThingsExpo, Dave McCarthy, Director of Products...
"ReadyTalk is an audio and web video conferencing provider. We've really come to embrace WebRTC as the platform for our future of technology," explained Dan Cunningham, CTO of ReadyTalk, in this SYS-CON.tv interview at WebRTC Summit at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
The many IoT deployments around the world are busy integrating smart devices and sensors into their enterprise IT infrastructures. Yet all of this technology – and there are an amazing number of choices – is of no use without the software to gather, communicate, and analyze the new data flows. Without software, there is no IT. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, Dave McCarthy, Director of Products at Bsquare Corporation; Alan Williamson, Principal...
Businesses and business units of all sizes can benefit from cloud computing, but many don't want the cost, performance and security concerns of public cloud nor the complexity of building their own private clouds. Today, some cloud vendors are using artificial intelligence (AI) to simplify cloud deployment and management. In his session at 20th Cloud Expo, Ajay Gulati, Co-founder and CEO of ZeroStack, will discuss how AI can simplify cloud operations. He will cover the following topics: why clou...
WebRTC is the future of browser-to-browser communications, and continues to make inroads into the traditional, difficult, plug-in web communications world. The 6th WebRTC Summit continues our tradition of delivering the latest and greatest presentations within the world of WebRTC. Topics include voice calling, video chat, P2P file sharing, and use cases that have already leveraged the power and convenience of WebRTC.
"At ROHA we develop an app called Catcha. It was developed after we spent a year meeting with, talking to, interacting with senior citizens watching them use their smartphones and talking to them about how they use their smartphones so we could get to know their smartphone behavior," explained Dave Woods, Chief Innovation Officer at ROHA, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.