Welcome!

Apache Authors: Pat Romanski, Liz McMillan, Elizabeth White, Christopher Harrold, Janakiram MSV

Blog Feed Post

Quick MapReduce with beanstalkd

At ProjectLocker, we operate a polyglot environment with a heavy Ruby bias. While we love Ruby and Rails, one of the drawbacks of Ruby is its Global VM Lock. In a nutshell, the Global VM Lock makes it harder to write Ruby code that can fully utilize a modern multi-core server. For Web applications, this isn’t a problem because the web server manages multiple processes for you (e.g. via Passenger). However, for offline processes, parallelism doesn’t come for free.

I was recently working on a project that involved the offline batch processing of lots of data. This project has been operating successfully for some time, but the data set has grown, causing the process to need more more time to complete than we’d like. So I dove in to see what we could do to speed it up. Fortunately, the process was still single-threaded, so we knew we’d be able to inject concurrency to increase throughput without adding hardware.

The job in question runs on a fairly well-equipped server, but the server was underutilized due to the process being serial. Here’s an outline of the initial code:

def main_job
  for retrieve_giant_dataset().each do |item|
    long_process(item)
  end

  summarize_results(retrieve_all_results()) 
end

def long_process(item)
  # Do some work on item that uses a lot of CPU time.
  item.save
end

That approach gets the job done, but I wanted to parallelize it. Conceptually, I wanted to transform the main_job method so that it looked something like this:

def main_job
  threads = []
  for retrieve_giant_dataset().each do |item|
    threads << Thread.new(item) do
      long_process(item)
    end
  end

  threads.each { |t| t.join }

  summarize_results(retrieve_all_results()) 
end

Unfortunately, it’s not that easy due to the aforementioned Global VM Lock. What I needed was a way to get my threads running on a bunch of independent processes. This is a problem tailor-made for a job queueing system. Enter beanstalkd, a simple & fast work queue. We paired beanstalkd with Stalker, a DSL that makes it easy to queue and process jobs from Ruby. Integrating these two was a cinch. Here’s what the restructured code looks like now:

def main_job
  for retrieve_giant_dataset().each do |item|
    Stalker.enqueue(JOB_NAME, :id => item.id)
  end

  beaneater = Beaneater::Pool.new(['localhost:11300'])
  tube = beaneater.tubes.find TUBE_NAME
  while tube.peek(:ready)
    sleep(5)
  end

  summarize_results(retrieve_all_results()) 
end

So instead of processing each item during the loop, now we just add each to the beanstalkd queue. Once we finish queueing all of the items, we wait until all of our entries have been processed by the worker processes. The workers are initiated via a jobs.rb file that looks something like this:

include Stalker
  
job JOB_NAME do |args|
  item = ItemClass.find(args['id'])
  Worker::long_process(item) 
end

We then start beanstalkd and a few worker processes and we’re off to the races. Now our job runs in parallel via multiple processes, and we can tune the number of worker processes we run to consume as much of the machine’s resources as we like. As a bonus, we can also run Stalker workers on other machines in our cluster for added parallelism. With just a few minor tweaks to our code, we’ve gone from single-threaded to a solution that is limited only by the capacity of the shared database used. Sweet!

What about the MapReduce reference in the title of this post? The MapReduce algorithm basically has two steps. In the Map step, you divide the work and assign it to worker nodes. The Reduce step simply combines the results of each individual node’s computation into an aggregate result. In our solution here, the Map step is done by us enqueuing our jobs into beanstalkd and then beanstalkd making the jobs available for consumption by our nodes. Our database serves to communicate the details of the jobs, and stands in for a shared filesystem like the HDFS used by Hadoop. I didn’t go into detail about this step, but our Reduce is also assisted by database aggregates; we’re able to construct a few simple queries that get us what we want from the database.

So there it is, distributed MapReduce for Ruby using beanstalkd, Stalker, and a healthy database. This is probably not the best solution if you need to scale to thousands or tens of thousands of workers. But if you just need to get tens of workers running in parallel quickly, you may be able to adapt this approach to fit your needs.

Read the original blog entry...

More Stories By Damon Young

Damon Young is Director of Sales at ProjectLocker.com. ProjectLocker was founded in 2003 to provide on-demand tools for software developers. Guided by the simple mission of helping companies build better software, ProjectLocker's services have expanded to include services for the complete lifecycle of software projects, from requirements documentation to build and test automation. ProjectLocker serves companies from startups to Fortune 1000 multinationals.

@ThingsExpo Stories
Existing Big Data solutions are mainly focused on the discovery and analysis of data. The solutions are scalable and highly available but tedious when swapping in and swapping out occurs in disarray and thrashing takes place. The resolution for thrashing through machine learning algorithms and support nomenclature is through simple techniques. Organizations that have been collecting large customer data are increasingly seeing the need to use the data for swapping in and out and thrashing occurs ...
DevOps at Cloud Expo – being held October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real r...
Detecting internal user threats in the Big Data eco-system is challenging and cumbersome. Many organizations monitor internal usage of the Big Data eco-system using a set of alerts. This is not a scalable process given the increase in the number of alerts with the accelerating growth in data volume and user base. Organizations are increasingly leveraging machine learning to monitor only those data elements that are sensitive and critical, autonomously establish monitoring policies, and to detect...
SYS-CON Events announced today that CollabNet, a global leader in enterprise software development, release automation and DevOps solutions, will be a Bronze Sponsor of SYS-CON's 20th International Cloud Expo®, taking place from June 6-8, 2017, at the Javits Center in New York City, NY. CollabNet offers a broad range of solutions with the mission of helping modern organizations deliver quality software at speed. The company’s latest innovation, the DevOps Lifecycle Manager (DLM), supports Value S...
SYS-CON Events announced today that A&I Solutions named "Bronze Sponsor" of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded over 15 years ago in 1999, A&I Solutions continues to provide companies with premier integrated enterprise solutions. By partnering with the trusted and proven solutions of leading technology companies, our customers are assured high performance levels across all IT environments including:...
SYS-CON Events announced today that Progress, a global leader in application development, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Enterprises today are rapidly adopting the cloud, while continuing to retain business-critical/sensitive data inside the firewall. This is creating two separate data silos – one inside the firewall and the other outside the firewall. Cloud ISVs ofte...
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
SYS-CON Events announced today that Enzu will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive ad...
The 21st International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Digital Transformation, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
SYS-CON Events announced today that delaPlex will exhibit at SYS-CON's @CloudExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. delaPlex pioneered Software Development as a Service (SDaaS), which provides scalable resources to build, test, and deploy software. It’s a fast and more reliable way to develop a new product or expand your in-house team.
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
Five years ago development was seen as a dead-end career, now it’s anything but – with an explosion in mobile and IoT initiatives increasing the demand for skilled engineers. But apart from having a ready supply of great coders, what constitutes true ‘DevOps Royalty’? It’ll be the ability to craft resilient architectures, supportability, security everywhere across the software lifecycle. In his keynote at @DevOpsSummit at 20th Cloud Expo, Jeffrey Scheaffer, GM and SVP, Continuous Delivery Busine...
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
SYS-CON Events announced today that Peak 10, Inc., a national IT infrastructure and cloud services provider, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Peak 10 provides reliable, tailored data center and network services, cloud and managed services. Its solutions are designed to scale and adapt to customers’ changing business needs, enabling them to lower costs, improve performance and focus intern...
SYS-CON Events announced today that Linux Academy, the foremost online Linux and cloud training platform and community, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Linux Academy was founded on the belief that providing high-quality, in-depth training should be available at an affordable price. Industry leaders in quality training, provided services, and student certification passes, its goal is to c...
SYS-CON Events announced today that Hitachi Data Systems, a wholly owned subsidiary of Hitachi LTD., will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City. Hitachi Data Systems (HDS) will be featuring the Hitachi Content Platform (HCP) portfolio. This is the industry’s only offering that allows organizations to bring together object storage, file sync and share, cloud storage gateways, and sophisticated search and...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 21st International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @ThingsExpo Silicon Valley Call for Papers is now open.
SYS-CON Events announced today that DivvyCloud will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. DivvyCloud software enables organizations to achieve their cloud computing goals by simplifying and automating security, compliance and cost optimization of public and private cloud infrastructure. Using DivvyCloud, customers can leverage programmatic Bots to identify and remediate common cloud problems in rea...
SYS-CON Events announced today that Tintri, Inc, a leading provider of enterprise cloud infrastructure, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Tintri offers an enterprise cloud platform built with public cloud-like web services and RESTful APIs. Organizations use Tintri all-flash storage with scale-out and automation as a foundation for their own clouds – to build agile development environments...