Click here to close now.

Welcome!

Apache Authors: Carmen Gonzalez, Ruxit Blog, Roger Strukhoff, Elizabeth White, Pat Romanski

Blog Feed Post

Why Contextual Data Locality Matters

Big Data is quickly overtaking SDN as a key phrase in today’s networking lingo. And overused already as it may be, it actually has a lot more meaning and definition compared to SDN. Big Data solutions are designed to work on lots of data as the name suggests. Of course they have been around forever, talk to any large bank, credit card company, airline or logistics company and all of them have had applications running on extremely large databases and data sets forever. But this is the new Big Data, the one inspired by Hadoop, MapReduce and friends. High performance compute clusters specifically created to analyze large amounts of data and reduce it to a form and quantity that human brains can use in decision making.

What makes today’s Big Data solutions different than its more traditional large database based applications, beyond the sheer datasets being analyzed, is the distributed nature of the analysis. Big Data solutions are designed to run across 100s or even 1000s of servers, each with multiple CPU cores to chew on the data. Traditional large database applications tend to be more localized with fewer applications and servers accessing the data, allowing for more tightly custom integrated solutions, the likes of which Oracle and friends are experts at.

Big Data Flashback

In the late 80s I started my career working as a network engineer for a high energy physics research institute. Working closely with the folks at CERN in Geneva, these physicists were (at the time, and probably still) masters of creating very large datasets. Every time an experiment was run, Tbytes of data (probably Pbytes by now) were generated by thousands of sensors along the tunnel or ring particles were passed through to collide.

The Big Data solution at the time was primitive, but not all that much different than today. The large datasets were manually broken into manageable pieces, something that would fit on a tape or disk. These datasets were then hand copied onto a compute server or super computer and the analysis application would churn through it to find specific data, correlate events and simply reduce the data to something smaller and meaningful. This would then create a new dataset, which would be combined, chopped up again, and the process repeated itself until they arrived at data that was consumable for humans to create new theories from, or provide a piece of proof of an existing theory.

During that first job, the IT group spend an enormous amount of time moving data around. A lot of it manual: tapes and disks were constantly being copied onto the appropriate compute server. The data had to be local to have any chance of analyzing the data. Between tapes, local disks and the network, the local disks were the only storage with appropriate speed to have a hope of finalizing the data reductions. And even then it would not be unusual to have a rather powerful (for the time) Apollo workstation run for several weeks on a single data set.

Back to the here and now

Forward the clock to now. The above description is really not that different from how Hadoop MapReduce works. Start with a big data set, chop it into pieces, replicate the data, compute on the data close to physical locality of the data. Then send results to Reducers, combine the results, then perhaps repeat again to get to human interpretable results.

As fast as we believe the network is within 10GbE access ports, it is still commonly the most restrictive component in the compute, distributed storage and network trio. Compute power increments have far outpaced network speed increments and even memory speed increments. We have many more cycles available to compute, but have not been able to get the data into these CPUs with the same increments. As a result, storage solutions are becoming increasingly distributed, closer to the compute power that needs it.

It’s a natural thought to have the data close to where it needs to be processed, close enough that the effort of retrieving it does not impact the overall completion of the task that uses that data. If I am writing a research paper that takes several hours to complete, I do not mind having to wait a second here or there for the right web sites to load. I would mind if I had to get into my car and drive to the library to look something up, drive back home to work on my paper, and keep doing that. The relationship between time and effort to get data has to become negligible compared to the time and effort required to complete the task.

Locality and growth

This type of contextual locality is extremely hard to manage in a dynamic and growing environment. How do you make sure that the right data remains contextually close to where it is needed when servers and VMs may not be physically close? They may not be in the same rack for the same application or customer, they may not even be in the same pod or datacenter. Storage is relatively cheap, but replication for closeness can very quickly lead to a data distribution complexity that is unmanageable in environments where its not a single orchestrated big data solution.

To solve this problem you need help from your network. You need to be able to create locality on the fly. Things that are not physically close need to be made virtually close, but with the characteristics of physical locality. And in network terms these are of course measured in the usual staples of latency and bandwidth. This is when you want to articulate relationships between the data and the applications that need that data and create virtual closeness that resembles the physical. This may mean dedicated paths through multiple switches to avoid congestion that will dramatically impact latency. These same paths can provide direct physical connectivity through dynamically engineered optical paths between application and storage, or simply appropriate prioritization of traffic along these paths. Without having to worry explicitly where the application is or where the storage is.

Physics will always stand in the way of what we really want or need, but that does not mean we use that same physics with a bit of math to create solutions that manage the complexity of creating dynamic locality. Locality is important. More pronounced in Big Data solutions, but even at a smaller scale it is important within the context of the compute effort on that data.

[Today's fun fact: Lake Superior is the world's largest lake. With that kind of naming accuracy we would like to hire the person that named the lake as our VP of Naming and Terminology]

The post Why Contextual Data Locality Matters appeared first on Plexxi.

Read the original blog entry...

More Stories By Michael Bushong

The best marketing efforts leverage deep technology understanding with a highly-approachable means of communicating. Plexxi's Vice President of Marketing Michael Bushong has acquired these skills having spent 12 years at Juniper Networks where he led product management, product strategy and product marketing organizations for Juniper's flagship operating system, Junos. Michael spent the last several years at Juniper leading their SDN efforts across both service provider and enterprise markets. Prior to Juniper, Michael spent time at database supplier Sybase, and ASIC design tool companies Synopsis and Magma Design Automation. Michael's undergraduate work at the University of California Berkeley in advanced fluid mechanics and heat transfer lend new meaning to the marketing phrase "This isn't rocket science."

@ThingsExpo Stories
GENBAND has announced that SageNet is leveraging the Nuvia platform to deliver Unified Communications as a Service (UCaaS) to its large base of retail and enterprise customers. Nuvia’s cloud-based solution provides SageNet’s customers with a full suite of business communications and collaboration tools. Two large national SageNet retail customers have recently signed up to deploy the Nuvia platform and the company will continue to sell the service to new and existing customers. Nuvia’s capabilities include HD voice, video, multimedia messaging, mobility, conferencing, Web collaboration, deskt...
The Open Compute Project is a collective effort by Facebook and a number of players in the datacenter industry to bring lessons learned from the social media giant's giant IT deployment to the rest of the world. Datacenters account for 3% of global electricity consumption – about the same as all of Switzerland or the Czech Republic -- according to people I met at the recent Open Compute Summit in San Jose. With increasing mobility at the edge of the cloud and vast new dataflows being predicted with the growth of the Internet of Things (and The Coming Age of Many Zettabytes) in the near...
The list of ‘new paradigm’ technologies that now surrounds us appears to be at an all time high. From cloud computing and Big Data analytics to Bring Your Own Device (BYOD) and the Internet of Things (IoT), today we have to deal with what the industry likes to call ‘paradigm shifts’ at every level of IT. This is disruption; of course, we understand that – change is almost always disruptive.
SYS-CON Events announced today that Cisco, the worldwide leader in IT that transforms how people connect, communicate and collaborate, has been named “Gold Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Cisco makes amazing things happen by connecting the unconnected. Cisco has shaped the future of the Internet by becoming the worldwide leader in transforming how people connect, communicate and collaborate. Cisco and our partners are building the platform for the Internet of Everything by connecting the...
SYS-CON Media announced today that @WebRTCSummit Blog, the largest WebRTC resource in the world, has been launched. @WebRTCSummit Blog offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. @WebRTCSummit Blog can be bookmarked ▸ Here @WebRTCSummit conference site can be bookmarked ▸ Here
Temasys has announced senior management additions to its team. Joining are David Holloway as Vice President of Commercial and Nadine Yap as Vice President of Product. Over the past 12 months Temasys has doubled in size as it adds new customers and expands the development of its Skylink platform. Skylink leads the charge to move WebRTC, traditionally seen as a desktop, browser based technology, to become a ubiquitous web communications technology on web and mobile, as well as Internet of Things compatible devices.
SYS-CON Events announced today that robomq.io will exhibit at SYS-CON's @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. robomq.io is an interoperable and composable platform that connects any device to any application. It helps systems integrators and the solution providers build new and innovative products and service for industries requiring monitoring or intelligence from devices and sensors.
Docker is an excellent platform for organizations interested in running microservices. It offers portability and consistency between development and production environments, quick provisioning times, and a simple way to isolate services. In his session at DevOps Summit at 16th Cloud Expo, Shannon Williams, co-founder of Rancher Labs, will walk through these and other benefits of using Docker to run microservices, and provide an overview of RancherOS, a minimalist distribution of Linux designed expressly to run Docker. He will also discuss Rancher, an orchestration and service discovery platf...
Sonus Networks introduced the Sonus WebRTC Services Solution, a virtualized Web Real-Time Communications (WebRTC) offer, purpose-built for the Cloud. The WebRTC Services Solution provides signaling from WebRTC-to-WebRTC applications and interworking from WebRTC-to-Session Initiation Protocol (SIP), delivering advanced real-time communications capabilities on mobile applications and on websites, which are accessible via a browser.
SYS-CON Events announced today that Aria Systems, the leading innovator in recurring revenue, has been named “Bronze Sponsor” of SYS-CON's @ThingsExpo, which will take place on June 9–11, 2015, at the Javits Center in New York, NY. Proven by the world’s most demanding enterprises, including AAA NCNU, Constant Contact, Falck, Hootsuite, Pitney Bowes, Telekom Denmark, and VMware, Aria helps enterprises grow their recurring revenue businesses. With Aria’s end-to-end active monetization platform, global brands can get to market faster with a wider variety of products and services, while maximizin...
SYS-CON Events announced today that Vitria Technology, Inc. will exhibit at SYS-CON’s @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Vitria will showcase the company’s new IoT Analytics Platform through live demonstrations at booth #330. Vitria’s IoT Analytics Platform, fully integrated and powered by an operational intelligence engine, enables customers to rapidly build and operationalize advanced analytics to deliver timely business outcomes for use cases across the industrial, enterprise, and consumer segments.
SYS-CON Events announced today that Akana, formerly SOA Software, has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. Akana’s comprehensive suite of API Management, API Security, Integrated SOA Governance, and Cloud Integration solutions helps businesses accelerate digital transformation by securely extending their reach across multiple channels – mobile, cloud and Internet of Things. Akana enables enterprises to share data as APIs, connect and integrate applications, drive part...
After making a doctor’s appointment via your mobile device, you receive a calendar invite. The day of your appointment, you get a reminder with the doctor’s location and contact information. As you enter the doctor’s exam room, the medical team is equipped with the latest tablet containing your medical history – he or she makes real time updates to your medical file. At the end of your visit, you receive an electronic prescription to your preferred pharmacy and can schedule your next appointment.
The WebRTC Summit 2014 New York, to be held June 9-11, 2015, at the Javits Center in New York, NY, announces that its Call for Papers is open. Topics include all aspects of improving IT delivery by eliminating waste through automated business models leveraging cloud technologies. WebRTC Summit is co-located with 16th International Cloud Expo, @ThingsExpo, Big Data Expo, and DevOps Summit.
SYS-CON Events announced today that Solgenia will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY, and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Solgenia is the global market leader in Cloud Collaboration and Cloud Infrastructure software solutions. Designed to “Bridge the Gap” between Personal and Professional Social, Mobile and Cloud user experiences, our solutions help large and medium-sized organizations dr...
SYS-CON Events announced today that Liaison Technologies, a leading provider of data management and integration cloud services and solutions, has been named "Silver Sponsor" of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York, NY. Liaison Technologies is a recognized market leader in providing cloud-enabled data integration and data management solutions to break down complex information barriers, enabling enterprises to make smarter decisions, faster.
Cloud is not a commodity. And no matter what you call it, computing doesn’t come out of the sky. It comes from physical hardware inside brick and mortar facilities connected by hundreds of miles of networking cable. And no two clouds are built the same way. SoftLayer gives you the highest performing cloud infrastructure available. One platform that takes data centers around the world that are full of the widest range of cloud computing options, and then integrates and automates everything. Join SoftLayer on June 9 at 16th Cloud Expo to learn about IBM Cloud's SoftLayer platform, explore se...
The 3rd International Internet of @ThingsExpo, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that its Call for Papers is open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.
SYS-CON Events announced today that CommVault has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY, and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. A singular vision – a belief in a better way to address current and future data management needs – guides CommVault in the development of Singular Information Management® solutions for high-performance data protection, universal availability and sim...
SYS-CON Media announced today that 9 out of 10 " most read" DevOps articles are published by @DevOpsSummit Blog. Launched in October 2014, @DevOpsSummit Blog offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce softw...