Welcome!

Apache Authors: Pat Romanski, Liz McMillan, Elizabeth White, Janakiram MSV, Gil Allouche

Related Topics: Java IoT, Industrial IoT, Microservices Expo, IoT User Interface, Apache, @BigDataExpo

Java IoT: Article

Fix Memory Leaks in Java Production Applications

A memory diagnostics approach for production identifies and fixes the root cause of the problem

Adding more memory to your JVMs (Java Virtual Machines) might be a temporary solution to fixing memory leaks in Java applications, but it for sure won't fix the root cause of the issue. Instead of crashing once per day it may just crash every other day. "Preventive" restarts are also just another desperate measure to minimize downtime, but, let's be frank: this is not how production issues should be solved.

One of our customers - a large online retail store - ran into such an issue. They run one of their online gift card self-service interfaces on two JVMs. During peak holiday seasons when users are activating their gift cards or checking the balance, crashes due to OOM (Out Of Memory) were more frequent, which caused bad user experience. The first "measure" they took was to double the JVM Heap Size. This didn't solve the problem as JVMs were still crashing, so they followed the memory diagnostics approach for production as explained in Java Memory Leaks to identify and fix the root cause of the problem.

Before we walk through the individual steps, let's look at the memory graph that shows the problems they had in December during the peak of the holiday season. The problem persisted even after increasing the memory. They could fix the problem after identifying the real root cause and applying specific configuration changes to a third-party software component.

After identifying the actual root cause and applying necessary configuration changes did the memory leak issue go away? Increasing Memory was not even a temporary solution that worked.

Step 1: Identify a Java Memory Leak
The first step is to monitor the JVM/CLR Memory Metrics such as Heap Space. This will tell us whether there is a potential memory leak. In this case we see memory usage constantly growing, resulting in an eventual runtime crash when the memory limit is reached.

Java Heap Size of both JVMs showed significant growth starting Dec 2nd and Dec 4th resulting in a crash on Dec 6th for both JVMs when the 512MB Max Heap Size was exceeded.

Step 2: Identify problematic Java Objects
The out-of-memory exception automatically triggers a full memory dump that allows for an analysis of which objects consumed the heap and are most likely to be the root cause of the out-of-memory crash. Looking at the objects that consumed most of the heap below indicates that they are related to a third-party logging API used by the application.

Sorting by GC (Garbage Collection) Size and focusing on custom classes (instead of system classes) shows that 80% of the heap is consumed by classes of a third-party logging framework

A closer look at an instance of the VPReportEntry4 shows that it contains five strings - with one consuming 23KB (as compared to several bytes of other string objects).This also explains the high GC Size of the String class in the overall Heap Dump.

Individual very large String objects as part of the ReportEntry object

Following the referrer chain further up reveals the complete picture. The EventQueue keeps LogEvents in an Array, which keeps VPReportEntrys in an Array. All of these objects seem to be kept in memory as the objects are being added to these arrays but never removed and therefore not garbage collected:

Following the referrer tree reveals that global EventQueue objects hold on to the LogEvent and VPReportEntry objects in array lists which are never removed from these arrays

Step 3: Who allocates these objects?
Analyzing object allocation allows us to figure out which part of the code is creating these objects and adding them to the queue. Creating what is called a "Selective Memory Dump" when the application reached 75% Heap Utilization showed the customer that the ReportWriter.report method allocated these entries and that they have been "living" on the heap for quite a while.

It is the report method that allocates the VPReportEntry objects that stay on the heap for quite a while

Step 4: Why are these objects not removed from the Heap?
The premise of the third-party logging framework is that log entries will be created by the application and written in batches at certain times by sending these log entries to a remote logging service using JMS. The memory behavior indicates that even though these log entries might be sent to the service, these objects are not always removed from the EventQueue leading to the out-of-memory exception.

Further analysis revealed that the background batch writer thread calls a logBatch method, which loops through the event queue (calling EventQueue.next) to send current log events in the queue. The question is whether as many messages were taken out of the queue (using next) vs put into the queue (using add) and whether the batch job is really called frequently enough to keep up with the incoming event entries. The following chart shows the method executions of add, as well as the call to logBatch highlighting that logBatch is actually not called frequently enough and therefore not calling next to remove messages from the queue:

The highlighted area shows that messages are put into the queue but not taken out because the background batch job is not executed. Once this leads to an OOM and the system restarts it goes back to normal operation but older log messages will be lost.

Step 5: Fixing the Java Memory Leak problem
After providing this information to the third-party provider and discussing with them the number of log entries and their system environment the conclusion was that our customer used a special logging mode that was not supposed to be used in high-load production environments. It's like running with DEBUG log level in a high load or production environment. This overwhelmed the remote logging service and this is why the batch logging thread was stopped and log events remained in the EventQueue until the out of memory occurred.

After making the recommended changes the system could again run with the previous heap memory size without experiencing any out-of-memory exceptions.

The Memory Leak issue has been solved and the application now runs even with the initial 512MB Heap Space without any problem.

They still use the same dashboards they have built to troubleshoot this issue, and to monitor for any future excessive logging problems.

These dashboards allow them to verify that the logging framework can keep up with log messages after they applied the changes.

Conclusion
Adding additional memory to crashing JVMs is most often not a temporary fix. If you have a real Java memory leak it will just take longer until the Java runtime crashes. It will even incur more overhead due to garbage collection when using larger heaps. The real answer to this is to use the simple approach explained here. Look at the memory metrics to identify whether you have a leak or not. Then identify which objects are causing the issue and why they are not collected by the GC. Working with engineers or third-party providers (as in this case) will help you find a permanent solution that allows you to run the system without impacting end users and without additional resource requirements.

Next Steps
If you want to learn more about Java Memory Management or general Application Performance Best Practices check out our free online Java Enterprise Performance Book. Existing customers of our APM Solution may also want to check out additional best practices on our APM Community.

More Stories By Andreas Grabner

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
Developing software for the Internet of Things (IoT) comes with its own set of challenges. Security, privacy, and unified standards are a few key issues. In addition, each IoT product is comprised of (at least) three separate application components: the software embedded in the device, the back-end service, and the mobile application for the end user’s controls. Each component is developed by a different team, using different technologies and practices, and deployed to a different stack/target –...
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices - comp...
SYS-CON Events announced today that Numerex Corp, a leading provider of managed enterprise solutions enabling the Internet of Things (IoT), will exhibit at the 19th International Cloud Expo | @ThingsExpo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Numerex Corp. (NASDAQ:NMRX) is a leading provider of managed enterprise solutions enabling the Internet of Things (IoT). The Company's solutions produce new revenue streams or create operating...
As cloud adoption continues to transform business, today’s global enterprises are challenged with managing a growing amount of information living outside of the data center. The rapid adoption of IoT and increasingly mobile workforce are exacerbating the problem. Ensuring secure data sharing and efficient backup poses capacity and bandwidth considerations as well as policy and regulatory compliance issues.
Why do your mobile transformations need to happen today? Mobile is the strategy that enterprise transformation centers on to drive customer engagement. In his general session at @ThingsExpo, Roger Woods, Director, Mobile Product & Strategy – Adobe Marketing Cloud, covered key IoT and mobile trends that are forcing mobile transformation, key components of a solid mobile strategy and explored how brands are effectively driving mobile change throughout the enterprise.
Data is the fuel that drives the machine learning algorithmic engines and ultimately provides the business value. In his session at Cloud Expo, Ed Featherston, a director and senior enterprise architect at Collaborative Consulting, will discuss the key considerations around quality, volume, timeliness, and pedigree that must be dealt with in order to properly fuel that engine.
The Internet of Things can drive efficiency for airlines and airports. In their session at @ThingsExpo, Shyam Varan Nath, Principal Architect with GE, and Sudip Majumder, senior director of development at Oracle, will discuss the technical details of the connected airline baggage and related social media solutions. These IoT applications will enhance travelers' journey experience and drive efficiency for the airlines and the airports. The session will include a working demo and a technical d...
19th Cloud Expo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterpri...
Although it has gained significant traction in the consumer space, IoT is still in the early stages of adoption in enterprises environments. However, many companies are working on initiatives like Industry 4.0 that includes IoT as one of the key disruptive technologies expected to reshape businesses of tomorrow. The key challenges will be availability, robustness and reliability of networks that connect devices in a business environment. Software Defined Wide Area Network (SD-WAN) is expected to...
Data is an unusual currency; it is not restricted by the same transactional limitations as money or people. In fact, the more that you leverage your data across multiple business use cases, the more valuable it becomes to the organization. And the same can be said about the organization’s analytics. In his session at 19th Cloud Expo, Bill Schmarzo, CTO for the Big Data Practice at EMC, will introduce a methodology for capturing, enriching and sharing data (and analytics) across the organizati...
SYS-CON Events announced today that Pulzze Systems will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Pulzze Systems, Inc. provides infrastructure products for the Internet of Things to enable any connected device and system to carry out matched operations without programming. For more information, visit http://www.pulzzesystems.com.
SYS-CON Events announced today Telecom Reseller has been named “Media Sponsor” of SYS-CON's 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
Almost two-thirds of companies either have or soon will have IoT as the backbone of their business in 2016. However, IoT is far more complex than most firms expected. How can you not get trapped in the pitfalls? In his session at @ThingsExpo, Tony Shan, a renowned visionary and thought leader, will introduce a holistic method of IoTification, which is the process of IoTifying the existing technology and business models to adopt and leverage IoT. He will drill down to the components in this fra...
Pulzze Systems was happy to participate in such a premier event and thankful to be receiving the winning investment and global network support from G-Startup Worldwide. It is an exciting time for Pulzze to showcase the effectiveness of innovative technologies and enable them to make the world smarter and better. The reputable contest is held to identify promising startups around the globe that are assured to change the world through their innovative products and disruptive technologies. There w...
There is growing need for data-driven applications and the need for digital platforms to build these apps. In his session at 19th Cloud Expo, Muddu Sudhakar, VP and GM of Security & IoT at Splunk, will cover different PaaS solutions and Big Data platforms that are available to build applications. In addition, AI and machine learning are creating new requirements that developers need in the building of next-gen apps. The next-generation digital platforms have some of the past platform needs a...
With so much going on in this space you could be forgiven for thinking you were always working with yesterday’s technologies. So much change, so quickly. What do you do if you have to build a solution from the ground up that is expected to live in the field for at least 5-10 years? This is the challenge we faced when we looked to refresh our existing 10-year-old custom hardware stack to measure the fullness of trash cans and compactors.
The emerging Internet of Everything creates tremendous new opportunities for customer engagement and business model innovation. However, enterprises must overcome a number of critical challenges to bring these new solutions to market. In his session at @ThingsExpo, Michael Martin, CTO/CIO at nfrastructure, outlined these key challenges and recommended approaches for overcoming them to achieve speed and agility in the design, development and implementation of Internet of Everything solutions wi...
Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more business becomes digital the more stakeholders are interested in this data including how it relates to business. Some of these people have never used a monitoring tool before. They have a question on their mind like “How is my application doing” but no id...
Cloud computing is being adopted in one form or another by 94% of enterprises today. Tens of billions of new devices are being connected to The Internet of Things. And Big Data is driving this bus. An exponential increase is expected in the amount of information being processed, managed, analyzed, and acted upon by enterprise IT. This amazing is not part of some distant future - it is happening today. One report shows a 650% increase in enterprise data by 2020. Other estimates are even higher....
Smart Cities are here to stay, but for their promise to be delivered, the data they produce must not be put in new siloes. In his session at @ThingsExpo, Mathias Herberts, Co-founder and CTO of Cityzen Data, will deep dive into best practices that will ensure a successful smart city journey.