How I sealed memory leak in my perfect Java Server
tl;dr: The stable server had OOM, and after some confusion, I was able locate it in the Redshift JDBC driver setting by analyzing heap dumps.
For years people have been hating JVM languages for being memory inefficient but very few understand that JVM has a grown a lot and all that bad reputation comes from following facts which no longer hold:-
- Java was one of the early languages to take responsibility of memory management from programmers to itself. For a language with support of dynamic memory allocation over heap, this was not an easy task and the garbage collecting algorithms were quite primitive. Much of this has changed, Java 9 has now holds a variety of sophisticated GC algorithms — Parallel, Concurrent mark and sweep, G1 etc. which can use multiprocessor architecture and smart strategies (don’t stop the world, short term and long term GC) to make GC more efficient and effective.
- Java in its inception was highly popular as language for writing web servers and applications. Those were the times of single processor and primitive web libraries. Overall the environment was highly ineffecient at handling scale and would easily fill the memory with increasing number of connections
- Java had/have some really poor written libraries(applets, spring hibernate?) which have been proven to be bulky, earning a bad name for Java.
- Bad developers: It is easy to write a code in Java (infact any GC based language) which can really kill the system. Some common mistakes developer make — (i) Using objects over primitive, (ii) creating tons on strings by using ‘+’ for concatenation, (iii) not closing the resources, (iv) creating objects faster than, they can be processed in a multithreading system.
Long story short, modern Java is fast, efficient and a good experienced developer has nothing to fear about Java taking too much memory. Excellent post here- is-java-really-slow
So I followed the best practices, handled exceptions and closed the resources, used best JVM has to offer for tweaking, to create a perfect server. So perfect that I never had to look into it for 5 months. No failures, no out of memory (OOM) errors, no deadlocks. All perfect, but then it happened -
Phase #1: Denial
The exception happened after the update we did to fix Spectre/Meltdown issue. My first reaction was to blame it all on the fix, but again the fix has nothing to do with memory usage. Only increased CPU usage was the known side effect which was obvious given the vulnerability.
Denial is the most useless but often response from any developer when faced with a bug in the system.
The fix had cause ec2 instance to be unavailable (still waiting from amazon to confirm), so I had to setup the server again. I blamed the wrong JVM parameters for the issue next. I had optimized my beloved server with some JVM flags, so I believed I might done a mistake while setting the flags again. After spending some quality time reading and researching the perfect flags, I decided to remove all the optimization — Still broken!
Phase #2: Anger/ Confusion
How could this be possible? The same code has been stable for 5 months and now it is having a memory leak. I just couldn’t digest it. The timing of this leak was perfectly aligned with the system update. I was convinced that something was wrong with system configuration which lead to i) an exception, ii) resource leak, iii) deadlock etc. leading to memory leak. Next I spent countless iteration in browsing the code to find a possible resource leak, but which resource it is?
Phase #3: The Action
After wasting time in thinking, griping and code browsing, I finally went to the logs, something I should have done a lot earlier. Java provides a series of useful utilities to analyze program state.
jstack: Thread stack dumps
My first response was to check for any deadlock in the threads. The server is fairly complex and with multiple threadpools sharing resources. Deadlock is still possible, though I am not able to find any scenario in code walk-throughs. Jstack is a perfect tool shipped with JDKs to get the stack trace of all the Threads in the system, and is a perfect tool to find i) deadlocks, ii) starvation, iii) any other anomaly in threadpools (too many threads?).
jstack 26293 > stack
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
— locked <0x00000005495c0208> (a sun.nio.ch.Util$3)
— locked <0x00000005495c01f8> (a java.util.Collections$UnmodifiableSet)
— locked <0x00000005495c0218> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at com.amazon.jdbc.communications.InboundMessagesThread.run(Unknown Source)
Jstack didn’t detect any deadlocks, I have perfectly correct number of threads in WAITING and RUNNABLE state.
jmap: let’s count the object
jmap is another JDK tool to troubleshoot the JVM issues. it essentially prints JVM memory statistics and can also provide snapshot of the complete heap snapshot.
In almost all the cases, OOM happens due to creation of too many live objects in heap
One of the best strategy to locate memory leak is to use -histo option of jmap. It prints an approximate number of objects of each class type along with total approximate size taken by all the instances.
- Get histogram of the objects.
- Force full GC by -histo:live option and take histogram again.
- Compare the histogram above and see where objects were not garbage collected. These are the objects where Java is still holding the references.
- Repeat the above steps in few intervals. After some iterations you will understand how many objects of each class remain in the memory after repeated GCs and long duration. These are long lived objects. Verify the logic, if these objects should be living long.
jmap -histo:live 26293 > old
num #instances #bytes class name
— — — — — — — — — — — — — — — — — — — — — — —
1: 50043884 2804420624 [B
2: 50038290 24018379202401837920 com.amazon.redshift.client.messages.inbound.DataRow
3: 50043663 1471389608 [Ljava.lang.Object;
4: 5130370 451698240 [C
5: 10187144 244661328 [I
6: 5129973 123119352 java.lang.String
7: 75753 2424096 java.util.concurrent.FutureTask
8: 75921 1822104 java.util.concurrent.ConcurrentLinkedQueue$Node
9: 20025 640800 java.util.HashMap$Node
10: 3331 380992 java.lang.Class
11: 11461 366752 java.util.concurrent.ConcurrentHashMap$Node
12: 1283 166544 [Ljava.util.HashMap$Node;
13: 1881 105336 com.mangat.example.DBRefresher$User
In above histogram, I was shocked to see 50 million objects of redshift Datarow type. This is certainly not I had expected.
Sometimes the root cause is not the one event which is visible, but a silent bomb in the code, waiting to explode.
Phase #3: The Root Cause
Now clearly something is wrong with the redshift client. Redshift is popular SQL based Big Data DB from Amazon. My server constantly query and retrieve new rows from Redshift. So I am arrived on following arguments —
i) some exception is leading to a code path, where resultset is not closed. Hence all the created objects by the Redshift client were still referenced by the umbrella Resultset object.
I looked into code again, this time I know which resource is leaking. I looked dozens of time, running all code paths from the query to processing of the resultset. It is always closed.
ii) What if it ResultSet is actually closed, and redshift library is not closing it cleanly?
Checked all the known issues, updated the library, nothing mentioned anything related to Redshift not cleanly closing the resultset. Plus how was it stable for months?
iii) What if Redshift driver loads whole data in memory?
That sounds like the only option remaining, I checked all the queries we are running and then Eureka. Somebody has created an sql query without ‘Distinct’, i.e. we are actually retrieving above 50 Million rows from DB. This number is matching with the number of objects I have.
Quick Fix: Simply use ResultSet.setFetchSize(n);
I took n to be 10000 and deployed the code again,
jmap -histo 22246 > new
2: 6443 309264 com.amazon.redshift.client.messages.inbound.DataRow
3: 53506 1284144 java.lang.String
4: 29594 947008 java.util.HashMap$Node
5: 8832 685560 [Ljava.lang.Object;
6: 3334 381304 java.lang.Class
7: 11371 363872 java.util.concurrent.ConcurrentHashMap$Node
8: 5814 325584 main.java.labs.DBRefresher$Partner
9: 2701 265360 [I
10: 1333 225664 [Ljava.util.HashMap$Node;
Voila! We have 6443 objects only. The server is stable from several hours. So the code had none of the common memory leak issues. It was a wrong query which revealed the actual problem, the missing parameter for Redshift driver
Also I later found a post from Amazon, about this same issue — https://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter
Any time tested code can have a code path which was never tested in production/QA
My takeaways?
1. Eliminate/reduce the denial phase. There is a reason of everything, and very often it is the code.
2. Focus on the concepts rather than jumping guns. I should have tried jmap sooner.
3. Try to read as much you can about the technology you use, at least all the official docs. Don’t be overconfident on your wisdom.
4. This issue doesn’t make me a less of an engineer, but a better one :)