Welcome!

Apache Authors: Pat Romanski, Liz McMillan, Elizabeth White, John Mertic, Janakiram MSV

Blog Feed Post

Preparing Big Data for Analysis in R

by Yaniv Mor, Co-founder & CEO of Xplenty How do you get Big Data ready for R? Gigabytes or terabytes of raw data may need to be combined, cleaned, and aggregated before they can be analyzed. Processing such large amounts of data used to require installing Hadoop on a cluster of servers, not to mention coding MapReduce jobs in Pig or Java. Those days are over. This post is going to show how raw data can be prepared for analysis in R without any code or server installations. Instead, we’ll use Xplenty’s data integration-as-a-service to design a data flow, create a cluster, and run the job all via a friendly user interface. For this demo we’ll use 1.5 GB of raw web logs (uncompressed) from the servers that hosted the ”Star Wars Kid” video. A remix of the video was also hosted there as well as the usual affair of HTMLs, images, and more. Here’s an example log line: 208.63.63.94 - - [11/Apr/2003:12:36:39 -0700] "GET /archive/2003/04/03/typo_pop.shtml HTTP/1.1" 200 28361 "http://www.kottke.org/" Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705)" Log line format: Source IP/domain User Identifier (blank) UserID (blank) Date - in the format of dd/MMM/yyyy:HH:mm:ss Z HTTP request - type, URL, HTTP version HTTP code Bytes transferred Referrer User agent Let’s say we would only like to analyze requests to the original “Star Wars Kid” video by source IP, date and referrer. Imagine what it would be like to setup the servers and write the code - the hours spent writing and debugging a relatively simple dataflow. Feel the stress building? Let it go. Here’s how such a dataflow looks like in Xplenty: Let’s take a closer look how it works: Source - loads the data from Amazon S3 and splits it into fields. The data is publicly available on S3 at xplenty.public/weblogs/star_wars_kid.log.gz. If you’d like to take a look at the data, download it via the web, or create an AWS account and use a tool such as S3Browser to access the above path. Select - only keeps the ip, date, url, and referrer fields while leaving the rest of the data out. Note that the date also contains the time, and that the request also contains the request type and HTTP version. They are both cleaned in the select component using a regular expression. Filter - matches Star_Wars_Kid.wmv in the URL field and removes any other log lines. Destination - stores the results back into Amazon S3.  No setup or installation is needed. Just a few clicks enables you to create a new cluster. Then, one more screen to get the job running. The results - about 120 MB (uncompressed) log lines of video file requests with IPs, URLs, and referrers that are now ready for analysis. Job running time - about 3 minutes. The full results are available in the xplenty.dumpster bucket at starwarskid/videos.gz. Here are a few sample lines: 66.142.89.235 09/May/2003 /random/video/Star_Wars_Kid.wmv http://www.waxy.org/ 63.195.36.218 09/May/2003 /random/video/Star_Wars_Kid.wmv - 66.27.235.199 09/May/2003 /random/video/Star_Wars_Kid.wmv http://www.kuro5hin.org/story/2003/5/2/16116/46048 24.81.67.79 09/May/2003 /random/video/Star_Wars_Kid.wmv http://www.waxy.org/archive/2003/04/29/star_war.shtml 12.149.141.14 09/May/2003 /random/video/Star_Wars_Kid.wmv http://www.waxy.org/ Now, we can finally analyze the data in R. Here’s sample code which generates a traffic graph by date for Star_Wars_Kid.wmv: df <- read.table('star-wars-kid.tsv', fill = TRUE) colnames(df) >

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@ThingsExpo Stories
@ThingsExpo has been named the ‘Top WebRTC Influencer' by iTrend. iTrend processes millions of conversations, tweets, interactions, news articles, press releases, blog posts - and extract meaning form them and analyzes mobile and desktop software platforms used to communicate, various metadata (such as geo location), and automation tools. In overall placement, @ThingsExpo ranked as the number one ‘WebRTC Influencer' followed by @DevOpsSummit at 55th.
"There's a growing demand from users for things to be faster. When you think about all the transactions or interactions users will have with your product and everything that is between those transactions and interactions - what drives us at Catchpoint Systems is the idea to measure that and to analyze it," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York Ci...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
SYS-CON Events announced today that Linux Academy, the foremost online Linux and cloud training platform and community, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Linux Academy was founded on the belief that providing high-quality, in-depth training should be available at an affordable price. Industry leaders in quality training, provided services, and student certification passes, its goal is to c...
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
In the next five to ten years, millions, if not billions of things will become smarter. This smartness goes beyond connected things in our homes like the fridge, thermostat and fancy lighting, and into heavily regulated industries including aerospace, pharmaceutical/medical devices and energy. “Smartness” will embed itself within individual products that are part of our daily lives. We will engage with smart products - learning from them, informing them, and communicating with them. Smart produc...
"What is the next step in the evolution of IoT systems? The answer is data, information, which is a radical shift from assets, from things to input for decision making," stated Michael Minkevich, VP of Technology Services at Luxoft, in this SYS-CON.tv interview at @ThingsExpo, held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA.
The emerging Internet of Everything creates tremendous new opportunities for customer engagement and business model innovation. However, enterprises must overcome a number of critical challenges to bring these new solutions to market. In his session at @ThingsExpo, Michael Martin, CTO/CIO at nfrastructure, outlined these key challenges and recommended approaches for overcoming them to achieve speed and agility in the design, development and implementation of Internet of Everything solutions with...
WebRTC sits at the intersection between VoIP and the Web. As such, it poses some interesting challenges for those developing services on top of it, but also for those who need to test and monitor these services. In his session at WebRTC Summit, Tsahi Levent-Levi, co-founder of testRTC, reviewed the various challenges posed by WebRTC when it comes to testing and monitoring and on ways to overcome them.
Internet of @ThingsExpo, taking place June 6-8, 2017 at the Javits Center in New York City, New York, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @ThingsExpo New York Call for Papers is now open.
Smart Cities are here to stay, but for their promise to be delivered, the data they produce must not be put in new siloes. In his session at @ThingsExpo, Mathias Herberts, Co-founder and CTO of Cityzen Data, discussed the best practices that will ensure a successful smart city journey.
Every successful software product evolves from an idea to an enterprise system. Notably, the same way is passed by the product owner's company. In his session at 20th Cloud Expo, Oleg Lola, CEO of MobiDev, will provide a generalized overview of the evolution of a software product, the product owner, the needs that arise at various stages of this process, and the value brought by a software development partner to the product owner as a response to these needs.
In 2014, Amazon announced a new form of compute called Lambda. We didn't know it at the time, but this represented a fundamental shift in what we expect from cloud computing. Now, all of the major cloud computing vendors want to take part in this disruptive technology. In his session at 20th Cloud Expo, John Jelinek IV, a web developer at Linux Academy, will discuss why major players like AWS, Microsoft Azure, IBM Bluemix, and Google Cloud Platform are all trying to sidestep VMs and containers...
SYS-CON Events announced today that MobiDev, a client-oriented software development company, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex softw...
The cloud market growth today is largely in public clouds. While there is a lot of spend in IT departments in virtualization, these aren’t yet translating into a true “cloud” experience within the enterprise. What is stopping the growth of the “private cloud” market? In his general session at 18th Cloud Expo, Nara Rajagopalan, CEO of Accelerite, explored the challenges in deploying, managing, and getting adoption for a private cloud within an enterprise. What are the key differences between wh...
"Tintri was started in 2008 with the express purpose of building a storage appliance that is ideal for virtualized environments. We support a lot of different hypervisor platforms from VMware to OpenStack to Hyper-V," explained Dan Florea, Director of Product Management at Tintri, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
The security needs of IoT environments require a strong, proven approach to maintain security, trust and privacy in their ecosystem. Assurance and protection of device identity, secure data encryption and authentication are the key security challenges organizations are trying to address when integrating IoT devices. This holds true for IoT applications in a wide range of industries, for example, healthcare, consumer devices, and manufacturing. In his session at @ThingsExpo, Lancen LaChance, vic...
WebRTC has had a real tough three or four years, and so have those working with it. Only a few short years ago, the development world were excited about WebRTC and proclaiming how awesome it was. You might have played with the technology a couple of years ago, only to find the extra infrastructure requirements were painful to implement and poorly documented. This probably left a bitter taste in your mouth, especially when things went wrong.
Big Data, cloud, analytics, contextual information, wearable tech, sensors, mobility, and WebRTC: together, these advances have created a perfect storm of technologies that are disrupting and transforming classic communications models and ecosystems. In his session at @ThingsExpo, Erik Perotti, Senior Manager of New Ventures on Plantronics’ Innovation team, provided an overview of this technological shift, including associated business and consumer communications impacts, and opportunities it m...
You have great SaaS business app ideas. You want to turn your idea quickly into a functional and engaging proof of concept. You need to be able to modify it to meet customers' needs, and you need to deliver a complete and secure SaaS application. How could you achieve all the above and yet avoid unforeseen IT requirements that add unnecessary cost and complexity? You also want your app to be responsive in any device at any time. In his session at 19th Cloud Expo, Mark Allen, General Manager of...