Welcome!

Apache Authors: Pat Romanski, Liz McMillan, Elizabeth White, Christopher Harrold, Janakiram MSV

Related Topics: @CloudExpo, Linux Containers, Open Source Cloud, Apache, @BigDataExpo

@CloudExpo: Article

Apache Spark vs. Hadoop | @CloudExpo #BigData #DevOps #Microservices

A choice of job styles

If you’re running Big Data applications, you’re going to want to look at some kind of distributed processing system. Hadoop is one of the best-known clustering systems, but how are you going to process all your data in a reasonable time frame? Apache Spark offers services that go beyond a standard MapReduce cluster.

A choice of job styles
MapReduce has become a standard, perhaps
the standard, for distributed file systems. While it’s a great system already, it’s really geared toward batch use, with jobs needing to queue for later output. This can severely hamper your flexibility. What if you want to explore some of your data? If it’s going to take all night, forget about it.

With Apache Spark, you can act on your data in whatever way you want. Want to look for interesting tidbits in your data? You can perform some quick queries. Want to run something you know will take a long time? You can use a batch job. Want to process your data streams in real time? You can do that too.

The biggest advantage of modern programming languages is their use of interactive shells. Sure, Lisp did that back in the ‘60s, but it was a long time before the kind of power to program interactively became available to the average programmer. With Python and Scala you can try out your ideas in real time and develop algorithms iteratively, without the time-consuming write/compile/test/debug cycle.

RDDs
The key to Spark’s flexibility is the Resilient Distributed Datasets, or RDDs. RDDs maintain a lineage of everything that’s done to your data. They’re fine-grained, keeping track of all changes that have been made from other transformations such as
map or join. This means that it’s possible to recover from failures by rebuilding from these transformations (which is why they’re called Resilient Distributed Datasets).

RDDs also represent data in memory, which is a lot faster than always pulling data off of disks—even with SSDs making their way into data centers. While having your data in memory might seem like a recipe for slow performance, Spark uses lazy evaluation, only making transformations on data when you specifically ask for the result. This is why you can get queries so quickly even on very large datasets.

You might have recognized the term “lazy evaluation” from functional programming languages like Haskell. RDDs are only loaded when specific actions produce some kind of output; for example, printing to a text file. You can have a complex query over your data, but it won’t actually be evaluated until you ask for it. And the query might only find a specific subset of your data instead of plowing through the whole thing. This lazy evaluation lets you create complex queries on large datasets without incurring a performance penalty.

RDDs are also immutable, which leads to greater protection against data loss even though they’re in memory. In case of an error, Spark can go back to the last part of an RDD’s lineage and recover from there rather than relying on a checkpoint-based system on a disk.

Spark and Hadoop, Not as Different as You Think
Speaking of disks, you might be wondering whether Spark replaces a Hadoop cluster. That’s really a false dichotomy. Hadoop and Spark work
together. While Spark provides the processing, Hadoop handles the actual storage and resource management. After all, you can’t store data in your memory forever.

With the combination of Spark and Hadoop in the same cluster, you can cut down on a lot of overhead in maintaining different clusters. This combined cluster will give you unlimited scale for Big Data operations.

Who’s Using Spark?
When you have your Big Data cluster in place, you’ll be able to do lots of interesting things. From genome sequencing analysis, to digital advertising to a major credit card company who uses Spark to match thousands of transactions at once
for possible fraud detection. Cisco does something similar with a cloud-based security product to spot possible hacking before it turns into a major data breach. Geneticists use it to match genes to new medicines.

Conclusion
Apache Spark builds on Hadoop and then goes beyond it by adding stream processing capabilities. The MapR distribution is the only one that offers everything you need right out of the box to enable real-time data processing.

For a more in-depth view into how Spark and Hadoop benefit from each other, read chapter four of the free interactive ebook: Getting Started with Apache Spark: From Inception to Production, by James A. Scott.

More Stories By Jim Scott

Jim has held positions running Operations, Engineering, Architecture and QA teams in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.

@ThingsExpo Stories
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
SYS-CON Events announced today that NetApp has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. NetApp is the data authority for hybrid cloud. NetApp provides a full range of hybrid cloud data services that simplify management of applications and data across cloud and on-premises environments to accelerate digital transformation. Together with their partners, NetApp em...
SYS-CON Events announced today that SIGMA Corporation will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. uLaser flow inspection device from the Japanese top share to Global Standard! Then, make the best use of data to flip to next page. For more information, visit http://www.sigma-k.co.jp/en/.
SYS-CON Events announced today that N3N will exhibit at SYS-CON's @ThingsExpo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. N3N’s solutions increase the effectiveness of operations and control centers, increase the value of IoT investments, and facilitate real-time operational decision making. N3N enables operations teams with a four dimensional digital “big board” that consolidates real-time live video feeds alongside IoT sensor data a...
Real IoT production deployments running at scale are collecting sensor data from hundreds / thousands / millions of devices. The goal is to take business-critical actions on the real-time data and find insights from stored datasets. In his session at @ThingsExpo, John Walicki, Watson IoT Developer Advocate at IBM Cloud, will provide a fast-paced developer journey that follows the IoT sensor data from generation, to edge gateway, to edge analytics, to encryption, to the IBM Bluemix cloud, to Wa...
There is huge complexity in implementing a successful digital business that requires efficient on-premise and cloud back-end infrastructure, IT and Internet of Things (IoT) data, analytics, Machine Learning, Artificial Intelligence (AI) and Digital Applications. In the data center alone, there are physical and virtual infrastructures, multiple operating systems, multiple applications and new and emerging business and technological paradigms such as cloud computing and XaaS. And then there are pe...
SYS-CON Events announced today that B2Cloud will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. B2Cloud specializes in IoT devices for preventive and predictive maintenance in any kind of equipment retrieving data like Energy consumption, working time, temperature, humidity, pressure, etc.
DevOps at Cloud Expo – being held October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real r...
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
SYS-CON Events announced today that Suzuki Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Suzuki Inc. is a semiconductor-related business, including sales of consuming parts, parts repair, and maintenance for semiconductor manufacturing machines, etc. It is also a health care business providing experimental research for...
SYS-CON Events announced today that Fusic will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Fusic Co. provides mocks as virtual IoT devices. You can customize mocks, and get any amount of data at any time in your test. For more information, visit https://fusic.co.jp/english/.
SYS-CON Events announced today that Ryobi Systems will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Ryobi Systems Co., Ltd., as an information service company, specialized in business support for local governments and medical industry. We are challenging to achive the precision farming with AI. For more information, visit http:...
SYS-CON Events announced today that Keisoku Research Consultant Co. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Keisoku Research Consultant, Co. offers research and consulting in a wide range of civil engineering-related fields from information construction to preservation of cultural properties. For more information, vi...
SYS-CON Events announced today that Daiya Industry will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Daiya Industry specializes in orthotic support systems and assistive devices with pneumatic artificial muscles in order to contribute to an extended healthy life expectancy. For more information, please visit https://www.daiyak...
SYS-CON Events announced today that Interface Corporation will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Interface Corporation is a company developing, manufacturing and marketing high quality and wide variety of industrial computers and interface modules such as PCIs and PCI express. For more information, visit http://www.i...
SYS-CON Events announced today that Mobile Create USA will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Mobile Create USA Inc. is an MVNO-based business model that uses portable communication devices and cellular-based infrastructure in the development, sales, operation and mobile communications systems incorporating GPS capabi...
In his session at @ThingsExpo, Greg Gorman is the Director, IoT Developer Ecosystem, Watson IoT, will provide a short tutorial on Node-RED, a Node.js-based programming tool for wiring together hardware devices, APIs and online services in new and interesting ways. It provides a browser-based editor that makes it easy to wire together flows using a wide range of nodes in the palette that can be deployed to its runtime in a single-click. There is a large library of contributed nodes that help so...
Elon Musk is among the notable industry figures who worries about the power of AI to destroy rather than help society. Mark Zuckerberg, on the other hand, embraces all that is going on. AI is most powerful when deployed across the vast networks being built for Internets of Things in the manufacturing, transportation and logistics, retail, healthcare, government and other sectors. Is AI transforming IoT for the good or the bad? Do we need to worry about its potential destructive power? Or will we...
SYS-CON Events announced today that mruby Forum will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. mruby is the lightweight implementation of the Ruby language. We introduce mruby and the mruby IoT framework that enhances development productivity. For more information, visit http://forum.mruby.org/.
SYS-CON Events announced today that Nihon Micron will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Nihon Micron Co., Ltd. strives for technological innovation to establish high-density, high-precision processing technology for providing printed circuit board and metal mount RFID tags used for communication devices. For more inf...