| By Damon Young | Article Rating: |
|
| January 15, 2013 12:00 AM EST | Reads: |
387 |
At ProjectLocker, we operate a polyglot environment with a heavy Ruby bias. While we love Ruby and Rails, one of the drawbacks of Ruby is its Global VM Lock. In a nutshell, the Global VM Lock makes it harder to write Ruby code that can fully utilize a modern multi-core server. For Web applications, this isn’t a problem because the web server manages multiple processes for you (e.g. via Passenger). However, for offline processes, parallelism doesn’t come for free.
I was recently working on a project that involved the offline batch processing of lots of data. This project has been operating successfully for some time, but the data set has grown, causing the process to need more more time to complete than we’d like. So I dove in to see what we could do to speed it up. Fortunately, the process was still single-threaded, so we knew we’d be able to inject concurrency to increase throughput without adding hardware.
The job in question runs on a fairly well-equipped server, but the server was underutilized due to the process being serial. Here’s an outline of the initial code:
def main_job
for retrieve_giant_dataset().each do |item|
long_process(item)
end
summarize_results(retrieve_all_results())
end
def long_process(item)
# Do some work on item that uses a lot of CPU time.
item.save
end
That approach gets the job done, but I wanted to parallelize it. Conceptually, I wanted to transform the main_job method so that it looked something like this:
def main_job
threads = []
for retrieve_giant_dataset().each do |item|
threads << Thread.new(item) do
long_process(item)
end
end
threads.each { |t| t.join }
summarize_results(retrieve_all_results())
end
Unfortunately, it’s not that easy due to the aforementioned Global VM Lock. What I needed was a way to get my threads running on a bunch of independent processes. This is a problem tailor-made for a job queueing system. Enter beanstalkd, a simple & fast work queue. We paired beanstalkd with Stalker, a DSL that makes it easy to queue and process jobs from Ruby. Integrating these two was a cinch. Here’s what the restructured code looks like now:
def main_job
for retrieve_giant_dataset().each do |item|
Stalker.enqueue(JOB_NAME, :id => item.id)
end
beaneater = Beaneater::Pool.new(['localhost:11300'])
tube = beaneater.tubes.find TUBE_NAME
while tube.peek(:ready)
sleep(5)
end
summarize_results(retrieve_all_results())
end
So instead of processing each item during the loop, now we just add each to the beanstalkd queue. Once we finish queueing all of the items, we wait until all of our entries have been processed by the worker processes. The workers are initiated via a jobs.rb file that looks something like this:
include Stalker
job JOB_NAME do |args|
item = ItemClass.find(args['id'])
Worker::long_process(item)
end
We then start beanstalkd and a few worker processes and we’re off to the races. Now our job runs in parallel via multiple processes, and we can tune the number of worker processes we run to consume as much of the machine’s resources as we like. As a bonus, we can also run Stalker workers on other machines in our cluster for added parallelism. With just a few minor tweaks to our code, we’ve gone from single-threaded to a solution that is limited only by the capacity of the shared database used. Sweet!
What about the MapReduce reference in the title of this post? The MapReduce algorithm basically has two steps. In the Map step, you divide the work and assign it to worker nodes. The Reduce step simply combines the results of each individual node’s computation into an aggregate result. In our solution here, the Map step is done by us enqueuing our jobs into beanstalkd and then beanstalkd making the jobs available for consumption by our nodes. Our database serves to communicate the details of the jobs, and stands in for a shared filesystem like the HDFS used by Hadoop. I didn’t go into detail about this step, but our Reduce is also assisted by database aggregates; we’re able to construct a few simple queries that get us what we want from the database.
So there it is, distributed MapReduce for Ruby using beanstalkd, Stalker, and a healthy database. This is probably not the best solution if you need to scale to thousands or tens of thousands of workers. But if you just need to get tens of workers running in parallel quickly, you may be able to adapt this approach to fit your needs.
Read the original blog entry...
Published January 15, 2013 Reads 387
Copyright © 2013 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Damon Young
Damon Young is Director of Sales at ProjectLocker.com. ProjectLocker was founded in 2003 to provide on-demand tools for software developers. Guided by the simple mission of helping companies build better software, ProjectLocker's services have expanded to include services for the complete lifecycle of software projects, from requirements documentation to build and test automation. ProjectLocker serves companies from startups to Fortune 1000 multinationals.
- Cloud People: A Who's Who of Cloud Computing
- Windows Azure IaaS Reaches General Availability
- Predixion Software Announces General Availability of the Latest Version of its Predictive Analytics Platform
- Cloud Expo New York: The Big Challenge of Big Data & Hadoop Integration
- Agile Solutions for Cloud, Big Data, Mobility Services
- MicroStrategy Announces General Availability of MicroStrategy 9.3.1
- Cloud Computing: Cutting Costs, Boosting Profits
- AMAX Launches StorMax(TM) CFS, powered by IBM(R) General Parallel File System(TM) (GPFS(TM))
- NIST to Sponsor FFRDC Widespread Adoption of Integrated CyberSecurity
- Big Data: Visualizing the Strategic Business Imperative
- MicroStrategy Announces General Availability of MicroStrategy 9.3.1
- Benefits of Cloud Computing
- Cloud People: A Who's Who of Cloud Computing
- Windows Azure IaaS Reaches General Availability
- Portable Experimenter’s Platform, Powered by Raspberry Pi
- Predixion Software Announces General Availability of the Latest Version of its Predictive Analytics Platform
- SUSE Receives Common Criteria Security Certifications
- Basho Announces Open Source Riak CS and General Availability of Riak CS Enterprise v1.3
- Cloud Expo New York: Big Time - Introducing Hadoop on Azure
- Cloud Expo New York: Real-Time Analytics Using an In-Memory Data Grid
- Cloud Expo New York: The Big Challenge of Big Data & Hadoop Integration
- Help Desk Solution Empowers Employees
- Public Cloud’s Got a Silver Lining: Gartner
- Granular Enforcement of Access to File Systems Featured in Latest Release of FoxT ServerControl
- The Top 250 Players in the Cloud Computing Ecosystem
- Web Services Using ColdFusion and Apache CXF
- Cloud People: A Who's Who of Cloud Computing
- Red Hat Named "Platinum Sponsor" of Virtualization Conference & Expo
- Cloud Expo New York Call for Papers Now Open
- Eclipse "Pollinate" Project to Integrate with Apache Beehive
- An Introduction to Ant
- Cloud Expo 2011 East To Attract 10,000 Delegates and 200 Exhibitors
- Beehive Code Now Available in Apache
- 4th International Cloud Computing Conference & Expo Starts Today
- Apache's Tomcat 5.5 is First Release Ever to Use Eclipse JDT Java Compiler
- "Beehive" Now Officially an Open Source Project: Apache Beehive


























