Welcome!

Apache Authors: Pat Romanski, Liz McMillan, Elizabeth White, Christopher Harrold, Janakiram MSV

Blog Feed Post

Book review: "Doing Data Science" by Rachel Schutt and Cathy O'Neil

by Joseph Rickert Every once in a while a single book comes to crystallize a new discipline. If books still have this power in the era of electronic media, "Doing Data Science, Straight Talk from the Frontline" by Rachel Schutt and Cathy O’Neil: O'Reilly, 2013 might just be the book that defines data science. "Doing Data Science", which is based on a course that Rachel taught at Columbia University and to which Cathy contributed, is ambitious and multidimensional. It presents data science in all of its messiness as an open-ended practice that is coalescing around an expanding class of problems; problems which are yielding to an interdisciplinary approach that includes ideas and techniques from statistics, computer science, machine learning, social science and other disciplines. The book is neither a statistics nor a machine learning text, but there are plenty of examples of statistical models and machine learning algorithms. There is enough R code in the text to get a beginner started on real problems with tools that are immediately useful. There is Python code, a bash shell script, mention of JSON and a down to earth discussion of Hadoop and MapReduce that many should find valuable. My favorite code example is the bash script (p 105) that fetches an Enron spam file and performs some basic word count calculations. Its almost casual insertion into the text, without fanfare and little explanation, provides a low key example of the kinds of baseline IT/ programmer skills that a newly minted statistician must acquire in order to work effectively as a data scientist. "Doing Data Science" is fairly well balanced in its fusion of the statistics and machine learning world views, but Rachel’s underlying bias as a PhD statistician comes through when it counts. The grounding in linear models and the inclusion of time series models establish the required inferential skills. The discussion of causality shows how statistical inference is essential to obtaining a deep understanding of how things really work, and the chapter on epidemiology provides a glimpse into just how deep and difficult are the problems that statisticians have been wrestling with for generations. (I found the inclusion of this chapter in a data science book to be a delightful surprise.) It is not only the selection of material, however, that betrays the book's statistical bias. When the authors take on the big questions their language indicates a statistical mindset. For example, in the discussion following "In what sense does data science deserve the word “science” in its name?" (p114) the authors write: “Every design choice you make can be formulated as an hypothesis, against which you will use rigorous testing and experimentation to validate or refute”. This is the language of a Neyman/Pearson trained statistician trying to pin down the truth. It stands in stark contrast with the machine learning viewpoint espoused in a quote by Kaggle’s Jeremy Howard who, when asked “Can you see any downside to the data-driven, black-box approach that dominates on Kaggle?”, replies: Some people take the view that you don’t end up with a richer understanding of the problem. But that’s just not true: The algorithms tell you what’s important and what’s not. You might ask why those things are important, but I think that’s less interesting. You end up with a predictive model that works. There is not too much to argue about there. So, whether you are doing science or not might just be in your intentions and point of view. Schutt and O’Neil do a marvelous job of exploring the tension between the quest for understanding and and the blunt success of just getting something that works. An unusual aspect of the book is its attempt to understand data science as a cultural phenomenon and to place the technology in a historical and social context. Most textbooks in mathematics, statistics and science make no mention of how things came to be. Their authors are just under too much pressure to get on with presenting the material to stop and and discuss “just what were those guys thinking?”. But Schutt and O’Neill take the time, and the book is richer for it. Mike Driscoll and Drew Conway, two practitioners who early on recognized that data science is something new, are quoted along with other contemporary data scientists who are shaping the discipline both through their work and how they talk about it. A great strength of the book is its collection of the real-world, big-league examples contributed by the guest lecturers to Rachel’s course.  Doug Perlson of Real Direct, Jake Hofman of Microsoft Research, Brian Dalessandro and Claudia Perlich both of Media6Degrees, Kyle Teague of GetGlue, William Cukierski of Kaggle, David Huffaker of Google, Matt Gattis of Hutch.com, Mark Hansen of Columbia University, Ian Wong of Square, John Kelley of Morningside Analytics and David Madigan, Chair of the Columbia’s Statistics Department, all bring thoughtful presentations of difficult problems with which they have struggled. The perspective and insight of these practicing data scientists and statisticians is invaluable. Claudia Perlich’s discussion of data leakage alone is probably worth the price of the book. A minor fault of the book is the occasional lapse into the hip vulgar. Someone being “pissed off” and talking about a model “that would totally suck” are probably innocuous enough phrases, but describing a vector as “huge ass” doesn’t really contribute to clarity. In a book that stresses communication, language counts. Nevertheless, "Doing Data Science" is a really “good read”. The authors have done a remarkable job of integrating class notes, their respective blogs, and the presentations of the guest speakers into a single, engaging voice that mostly speaks clearly to the reader. I think this book will appeal to a wide audience. Beginners asking the question “How do I get into data science?” will find the book to be a guide that will take them a long way. Accomplished data scientists will find a perspective on their profession that they should appreciate as being both provocative and valuable. "Doing Data Science" argues eloquently for a technology that respects humanist ideals and ethical considerations. We should all be asking "What problems should I be working on?", "Am I doing science or not?", and "What are the social and ethical implications of my work?". Finally, technical managers charged with assembling a data science team, and other interested outsiders, should find the book helpful in getting beyond the hype and and having a look at what it really takes to squeeze insight from data.

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@ThingsExpo Stories
NHK, Japan Broadcasting, will feature the upcoming @ThingsExpo Silicon Valley in a special 'Internet of Things' and smart technology documentary that will be filmed on the expo floor between November 3 to 5, 2015, in Santa Clara. NHK is the sole public TV network in Japan equivalent to the BBC in the UK and the largest in Asia with many award-winning science and technology programs. Japanese TV is producing a documentary about IoT and Smart technology and will be covering @ThingsExpo Silicon Val...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
SYS-CON Events announced today that CollabNet, a global leader in enterprise software development, release automation and DevOps solutions, will be a Bronze Sponsor of SYS-CON's 20th International Cloud Expo®, taking place from June 6-8, 2017, at the Javits Center in New York City, NY. CollabNet offers a broad range of solutions with the mission of helping modern organizations deliver quality software at speed. The company’s latest innovation, the DevOps Lifecycle Manager (DLM), supports Value S...
The age of Digital Disruption is evolving into the next era – Digital Cohesion, an age in which applications securely self-assemble and deliver predictive services that continuously adapt to user behavior. Information from devices, sensors and applications around us will drive services seamlessly across mobile and fixed devices/infrastructure. This evolution is happening now in software defined services and secure networking. Four key drivers – Performance, Economics, Interoperability and Trust ...
With billions of sensors deployed worldwide, the amount of machine-generated data will soon exceed what our networks can handle. But consumers and businesses will expect seamless experiences and real-time responsiveness. What does this mean for IoT devices and the infrastructure that supports them? More of the data will need to be handled at - or closer to - the devices themselves.
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place June 6-8, 2017, at the Javits Center in New York City, New York, is co-located with 20th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry p...
In his keynote at @ThingsExpo, Chris Matthieu, Director of IoT Engineering at Citrix and co-founder and CTO of Octoblu, focused on building an IoT platform and company. He provided a behind-the-scenes look at Octoblu’s platform, business, and pivots along the way (including the Citrix acquisition of Octoblu).
Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the USA and Europe, we work with a variety of customers from emerging startups to Fortune 1000 companies.
Financial Technology has become a topic of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 20th Cloud Expo at the Javits Center in New York, June 6-8, 2017, will find fresh new content in a new track called FinTech.
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
Multiple data types are pouring into IoT deployments. Data is coming in small packages as well as enormous files and data streams of many sizes. Widespread use of mobile devices adds to the total. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists will look at the tools and environments that are being put to use in IoT deployments, as well as the team skills a modern enterprise IT shop needs to keep things running, get a handle on all this data, and deli...
The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound e...
@ThingsExpo has been named the Most Influential ‘Smart Cities - IIoT' Account and @BigDataExpo has been named fourteenth by Right Relevance (RR), which provides curated information and intelligence on approximately 50,000 topics. In addition, Right Relevance provides an Insights offering that combines the above Topics and Influencers information with real time conversations to provide actionable intelligence with visualizations to enable decision making. The Insights service is applicable to eve...
SYS-CON Events announced today that Grape Up will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company specializing in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the U.S. and Europe, Grape Up works with a variety of customers from emergi...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
@GonzalezCarmen has been ranked the Number One Influencer and @ThingsExpo has been named the Number One Brand in the “M2M 2016: Top 100 Influencers and Brands” by Analytic. Onalytica analyzed tweets over the last 6 months mentioning the keywords M2M OR “Machine to Machine.” They then identified the top 100 most influential brands and individuals leading the discussion on Twitter.
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
Five years ago development was seen as a dead-end career, now it’s anything but – with an explosion in mobile and IoT initiatives increasing the demand for skilled engineers. But apart from having a ready supply of great coders, what constitutes true ‘DevOps Royalty’? It’ll be the ability to craft resilient architectures, supportability, security everywhere across the software lifecycle. In his keynote at @DevOpsSummit at 20th Cloud Expo, Jeffrey Scheaffer, GM and SVP, Continuous Delivery Busine...