Click here to close now.

Welcome!

Apache Authors: Elizabeth White, Liz McMillan, Carmen Gonzalez, Plutora Blog, Pat Romanski

Related Topics: Apache

Apache: Blog Post

Apache Hadoop: Technical Debt Decreased by 14% Through Code Refactoring

Initial Technical Debt of the project reduced from 136 to 117 days of remediation

Technical Debt is worth nothing if no pragmatic action is taken into code, in order to control and tackle it. To illustrate Scertify's capability to automatically correct code defects that increase this unintended debt, we performed code refactoring on two subprojects of the Hadoop project : Hadoop Common and Hadoop Mapreduce. Thanks to Scertify, we were able to correct 25K defects in 2 minutes. In other words, 14% of the Technical Debt has been written-off without any human effort needed.

Initial analysis
According to Wikipedia, Apache Hadoop is "an open-source software framework that supports data-intensive distributed applications". This framework contains several projects, Common and Mapreduce are two important ones with respectively 120K and 162K lines of code (blank lines and comments excluded). The version we worked with is the last development version : 3.0.0-SNAPSHOT. We ran Scertify Refactoring Assessment, our open-source plugin for Sonar, on the projects, in order to get an overview of their technical debt. Technical debt is defined as the amount of time needed to correct all defects detected. As you can see on screen-shots below, Common has a technical debt of 70 days and Mapreduce of 66 days. Scertify Refactoring Assessment also computes the potential of automatic correction of the technical debt : the debt write-off. They both have a good potential for automatic refactoring, respectively 38 and 36 days. So, the next step is to use Scertify to perform this automatic refactoring. By the way, if you would like to try it with your own source code, a trial version of Scertify is available here.

Hadoop Common Original Techdebt

Hadoop Mapreduce Original Technical debt

We scrolled among the various errors and we chose 8 rules to perform the demonstration.

Refactoring rules for the demonstration

Here's a presentation of the refactoring rules we used in this demonstration. As you can see, some rules need parameters to be efficient. This is the case of rules regarding logging. The logging framework used in those project is Apache Common logging, so we configured the rules to use this framework.

AvoidPrintStackTrace

This rule reports a violation when it finds a code that catch an expression and print its stack trace to the standard error output. A logging framework should be used instead, in order to improve application's maintainability. The refactoring replace a call to print stack trace by a call to a logging framework. The rule can also declare the logger in the class and make the required imports. Here's an example of the original code and the refactored code in the class GenericWritable.

Original code:

catch (Exception e) {
      e.printStackTrace();
      throw new IOException("Cannot initialize the class: " + clazz);
}

Refactored code:

catch (final Exception e) {
      LOG.error(e.getMessage(), e);
      throw new IOException("Cannot initialize the class: " + clazz);
}

In this case, LOG was not declared so it was added to the class and import were made :

private static final Log LOG = LogFactory.getLog(GenericWritable.class);

InefficientConstructorCall
Calling the constructor of a wrapper type, like Integer, to convert a primitive type is a bad practice. It is less efficient than calling the static method valueOf.

PositionLiteralsFirstInComparisonsRefactor

This rule checks that literals are in the first position in comparisons. The refactoring invert the literal and the variable. This ensures that the code cannot crash due to the variable being a null pointer.

AddEmptyStringToConvert

Using the concatenation of an empty string to convert a primitive type to a String is a bad practice. First of all, it makes the code less readable. It is also less efficient in most cases (the only case where the string concatenation is slightly better is when the primitive is final). Here's an example taken from class MD5MD5CRC32FileChecksum.

Original code:

xml.attribute("bytesPerCRC", "" + that.bytesPerCRC);

Refactored code:

xml.attribute("bytesPerCRC", String.valueOf(that.bytesPerCRC));

GuardDebugLogging
When a concatenation of String is performed inside a debug log, one should check if debug is enabled before making the call. Otherwise, the String concatenation will always be done. The refactoring adds a guard before the call to debug. In this case, it is configured to use the method isDebugEnabled(), since we use Apache's log. Below is an example of refactored code taken from class ActiveStandByElector:

if(LOG.isDebugEnabled()){
        LOG.debug("StatNode result: " + rc + " for path: " + path + " connectionState: " + zkConnectionState + " for " + this);
}

IfElseStmtsMustUseBraces

This rule finds if statements that don't use braces. The refactoring adds required braces.

UseCollectionIsEmpty

This rule finds usage of Collection's size method to check if a collection is empty. Rather than using size(), it is better to use isEmpty() making the code easier to read. The refactoring replace comparisons between size and 0 with a call to isEmpty().

LocalVariableCouldBeFinal

This method flags local variables that could be declared final and are not. The use of the final keyword is a useful information for future code readers. The refactoring adds the "final" keword. This is not a critical rule, but since it has a huge number of violations, it is useful to get rid of them quickly with automatic refactoring.

Scertify's refactoring results
So we ran Scertify on both projects to detect and refactor those rules. On each project it took around 1 minute to perform the full process. Scertify generates an html report with information on errors detected and corrected. Below is a summary of all errors corrected in the two projects. Many minor things were corrected, but also more important ones. Overall, it took 2 minutes to correct 25392 defects. Not so bad isn't it? Those defects include both minor violations and more critical violations in term of maintainability, performance or robustness.

Violations refactored

As you can see on screen-shots below, with those defects corrected the technical debt of each project has been reduced of 10 days. Overall, that's 20 day of technical debt that have been written-off.

Refactored Common technical debt

Refactored Mapreduce technical debt

Last but not least, Hadoop contains many unit tests and of course we made sure that they still succeed after the refactoring. To conclude, thanks to Scertify's refactoring features we were able to efficiently correct 25K defects in few minutes. We are glad to make the refactored code available to community, you can download it below. We will continue to do such refactoring on open-source applications, so if you have an idea for an open-source project that could leverage such refactoring, just let us know!

Download the source files

More Stories By Michael Muller

Michael Muller, a Marketing Manager at Tocea, has 10+ years of experience as a Marketing and Communication Manager. He specializes in technology and innovative companies. He is executive editor at http://dsisionnel.com, a French IT magazine and the creator of http://d8p.it, a cool URL shortener. Dad of two kids.

@ThingsExpo Stories
One of the biggest impacts of the Internet of Things is and will continue to be on data; specifically data volume, management and usage. Companies are scrambling to adapt to this new and unpredictable data reality with legacy infrastructure that cannot handle the speed and volume of data. In his session at @ThingsExpo, Don DeLoach, CEO and president of Infobright, will discuss how companies need to rethink their data infrastructure to participate in the IoT, including: Data storage: Understanding the kinds of data: structured, unstructured, big/small? Analytics: What kinds and how responsiv...
The Workspace-as-a-Service (WaaS) market will grow to $6.4B by 2018. In his session at 16th Cloud Expo, Seth Bostock, CEO of IndependenceIT, will begin by walking the audience through the evolution of Workspace as-a-Service, where it is now vs. where it going. To look beyond the desktop we must understand exactly what WaaS is, who the users are, and where it is going in the future. IT departments, ISVs and service providers must look to workflow and automation capabilities to adapt to growing demand and the rapidly changing workspace model.
Since 2008 and for the first time in history, more than half of humans live in urban areas, urging cities to become “smart.” Today, cities can leverage the wide availability of smartphones combined with new technologies such as Beacons or NFC to connect their urban furniture and environment to create citizen-first services that improve transportation, way-finding and information delivery. In her session at @ThingsExpo, Laetitia Gazel-Anthoine, CEO of Connecthings, will focus on successful use cases.
Sensor-enabled things are becoming more commonplace, precursors to a larger and more complex framework that most consider the ultimate promise of the IoT: things connecting, interacting, sharing, storing, and over time perhaps learning and predicting based on habits, behaviors, location, preferences, purchases and more. In his session at @ThingsExpo, Tom Wesselman, Director of Communications Ecosystem Architecture at Plantronics, will examine the still nascent IoT as it is coalescing, including what it is today, what it might ultimately be, the role of wearable tech, and technology gaps stil...
Almost everyone sees the potential of Internet of Things but how can businesses truly unlock that potential. The key will be in the ability to discover business insight in the midst of an ocean of Big Data generated from billions of embedded devices via Systems of Discover. Businesses will also need to ensure that they can sustain that insight by leveraging the cloud for global reach, scale and elasticity.
The Internet of Things (IoT) promises to evolve the way the world does business; however, understanding how to apply it to your company can be a mystery. Most people struggle with understanding the potential business uses or tend to get caught up in the technology, resulting in solutions that fail to meet even minimum business goals. In his session at @ThingsExpo, Jesse Shiah, CEO / President / Co-Founder of AgilePoint Inc., showed what is needed to leverage the IoT to transform your business. He discussed opportunities and challenges ahead for the IoT from a market and technical point of vie...
IoT is still a vague buzzword for many people. In his session at @ThingsExpo, Mike Kavis, Vice President & Principal Cloud Architect at Cloud Technology Partners, discussed the business value of IoT that goes far beyond the general public's perception that IoT is all about wearables and home consumer services. He also discussed how IoT is perceived by investors and how venture capitalist access this space. Other topics discussed were barriers to success, what is new, what is old, and what the future may hold. Mike Kavis is Vice President & Principal Cloud Architect at Cloud Technology Pa...
Hadoop as a Service (as offered by handful of niche vendors now) is a cloud computing solution that makes medium and large-scale data processing accessible, easy, fast and inexpensive. In his session at Big Data Expo, Kumar Ramamurthy, Vice President and Chief Technologist, EIM & Big Data, at Virtusa, will discuss how this is achieved by eliminating the operational challenges of running Hadoop, so one can focus on business growth. The fragmented Hadoop distribution world and various PaaS solutions that provide a Hadoop flavor either make choices for customers very flexible in the name of opti...
The true value of the Internet of Things (IoT) lies not just in the data, but through the services that protect the data, perform the analysis and present findings in a usable way. With many IoT elements rooted in traditional IT components, Big Data and IoT isn’t just a play for enterprise. In fact, the IoT presents SMBs with the prospect of launching entirely new activities and exploring innovative areas. CompTIA research identifies several areas where IoT is expected to have the greatest impact.
Advanced Persistent Threats (APTs) are increasing at an unprecedented rate. The threat landscape of today is drastically different than just a few years ago. Attacks are much more organized and sophisticated. They are harder to detect and even harder to anticipate. In the foreseeable future it's going to get a whole lot harder. Everything you know today will change. Keeping up with this changing landscape is already a daunting task. Your organization needs to use the latest tools, methods and expertise to guard against those threats. But will that be enough? In the foreseeable future attacks w...
Disruptive macro trends in technology are impacting and dramatically changing the "art of the possible" relative to supply chain management practices through the innovative use of IoT, cloud, machine learning and Big Data to enable connected ecosystems of engagement. Enterprise informatics can now move beyond point solutions that merely monitor the past and implement integrated enterprise fabrics that enable end-to-end supply chain visibility to improve customer service delivery and optimize supplier management. Learn about enterprise architecture strategies for designing connected systems tha...
Dale Kim is the Director of Industry Solutions at MapR. His background includes a variety of technical and management roles at information technology companies. While his experience includes work with relational databases, much of his career pertains to non-relational data in the areas of search, content management, and NoSQL, and includes senior roles in technical marketing, sales engineering, and support engineering. Dale holds an MBA from Santa Clara University, and a BA in Computer Science from the University of California, Berkeley.
Wearable devices have come of age. The primary applications of wearables so far have been "the Quantified Self" or the tracking of one's fitness and health status. We propose the evolution of wearables into social and emotional communication devices. Our BE(tm) sensor uses light to visualize the skin conductance response. Our sensors are very inexpensive and can be massively distributed to audiences or groups of any size, in order to gauge reactions to performances, video, or any kind of presentation. In her session at @ThingsExpo, Jocelyn Scheirer, CEO & Founder of Bionolux, will discuss ho...
The cloud is now a fact of life but generating recurring revenues that are driven by solutions and services on a consumption model have been hard to implement, until now. In their session at 16th Cloud Expo, Ermanno Bonifazi, CEO & Founder of Solgenia, and Ian Khan, Global Strategic Positioning & Brand Manager at Solgenia, will discuss how a top European telco has leveraged the innovative recurring revenue generating capability of the consumption cloud to enable a unique cloud monetization model to drive results.
As organizations shift toward IT-as-a-service models, the need for managing and protecting data residing across physical, virtual, and now cloud environments grows with it. CommVault can ensure protection &E-Discovery of your data – whether in a private cloud, a Service Provider delivered public cloud, or a hybrid cloud environment – across the heterogeneous enterprise. In his session at 16th Cloud Expo, Randy De Meno, Chief Technologist - Windows Products and Microsoft Partnerships, will discuss how to cut costs, scale easily, and unleash insight with CommVault Simpana software, the only si...
Analytics is the foundation of smart data and now, with the ability to run Hadoop directly on smart storage systems like Cloudian HyperStore, enterprises will gain huge business advantages in terms of scalability, efficiency and cost savings as they move closer to realizing the potential of the Internet of Things. In his session at 16th Cloud Expo, Paul Turner, technology evangelist and CMO at Cloudian, Inc., will discuss the revolutionary notion that the storage world is transitioning from mere Big Data to smart data. He will argue that today’s hybrid cloud storage solutions, with commodity...
Cloud data governance was previously an avoided function when cloud deployments were relatively small. With the rapid adoption in public cloud – both rogue and sanctioned, it’s not uncommon to find regulated data dumped into public cloud and unprotected. This is why enterprises and cloud providers alike need to embrace a cloud data governance function and map policies, processes and technology controls accordingly. In her session at 15th Cloud Expo, Evelyn de Souza, Data Privacy and Compliance Strategy Leader at Cisco Systems, will focus on how to set up a cloud data governance program and s...
Every innovation or invention was originally a daydream. You like to imagine a “what-if” scenario. And with all the attention being paid to the so-called Internet of Things (IoT) you don’t have to stretch the imagination too much to see how this may impact commercial and homeowners insurance. We’re beyond the point of accepting this as a leap of faith. The groundwork is laid. Now it’s just a matter of time. We can thank the inventors of smart thermostats for developing a practical business application that everyone can relate to. Gone are the salad days of smart home apps, the early chalkb...
Roberto Medrano, Executive Vice President at SOA Software, had reached 30,000 page views on his home page - http://RobertoMedrano.SYS-CON.com/ - on the SYS-CON family of online magazines, which includes Cloud Computing Journal, Internet of Things Journal, Big Data Journal, and SOA World Magazine. He is a recognized executive in the information technology fields of SOA, internet security, governance, and compliance. He has extensive experience with both start-ups and large companies, having been involved at the beginning of four IT industries: EDA, Open Systems, Computer Security and now SOA.
The industrial software market has treated data with the mentality of “collect everything now, worry about how to use it later.” We now find ourselves buried in data, with the pervasive connectivity of the (Industrial) Internet of Things only piling on more numbers. There’s too much data and not enough information. In his session at @ThingsExpo, Bob Gates, Global Marketing Director, GE’s Intelligent Platforms business, to discuss how realizing the power of IoT, software developers are now focused on understanding how industrial data can create intelligence for industrial operations. Imagine ...