CMS Garbage Collection taking too long

Question

One of our customers runs into major performance problems which seem to be caused by long garbage collection runs and I don't know what to do about it...

Java version: 1.7.0_67 JVM args: -Xms10240m -Xmx16386m -XX:NewSize=2048m -XX:MaxPermSize=150m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails ...

Additional info

The application is our server backend for a web application. The server connects to a SQL database and multiple ERP systems. This data is then processed and sent out to the web clients. This customer in particular has around 50 users world wide that access data via this single server.

OS: Windows Server 2008 R2 Datacenter + SP1
CPU: Xeon E5-2660 (4 Cores + HT)
RAM: 32 GB

What I've tried so far

increased Xmx and Xms (read that Xms should be identical or at least near Xmx for best performance)
tinkered with size of Young Generation (Ratio 1, 2, 3)
increased PermGeneration
tried different GC variants

Here are some parts of the gc log. You can clearly see that gc duration increases over time and peaks at about !!!17 seconds!!!.

What could cause these problems? Are there any JVM args that sould be tinkered with?

Update / 25.03.2016

After getting VisualVM running it seems I found one of / the problem: Young Generation is now set to 4GB - the ratio between Eden and Survivor seems to be way off... Eden = 3,5 GB while Survivor = 0,5 GB As there are a lot of new objects generated all the time (a thing I'll have to investigate further), Eden gets populated with more then 0,5 GB (more like 1 to 2 GB). Survivor can't hold it and therefore a Major GC is triggered.

New questions arising:

Are my assumptiums wrong?
I think my next steps should be to manually set the survivor ratio. Any suggestions based on my case?
Do I need to disable AdaptiveSizePolicy? (-XX:-UseAdaptiveSizePolicy)
Why does the GC log show the cause as "System.gc()" while there is definitely no manual GC triggered in code (decompiled all classes and checked):

2016-03-25T04:59:55.726+0100: 14992.418: [Full GC (System.gc())2016-03-25T04:59:55.726+0100: 14992.418: [CMS: 265708K->265749K(6291456K), 1.1520720 secs] 507160K->265749K(10066368K), [CMS Perm : 54875K->54875K(91504K)], 1.1523255 secs] [Times: user=1.14 sys=0.00, real=1.14 secs]

Old Logs / 24.03.2016

First few logs after start:

12.434: [GC12.434: [ParNew: 1677824K->100069K(1887488K), 0.0983359 secs] 1677824K->100069K(10276096K), 0.0985257 secs] [Times: user=0.31 sys=0.02, real=0.11 secs] 
16.293: [GC16.293: [ParNew: 1777893K->209664K(1887488K), 0.1933143 secs] 1777893K->212290K(10276096K), 0.1934443 secs] [Times: user=0.50 sys=0.02, real=0.19 secs] 
19.994: [Full GC19.994: [CMS: 2626K->172662K(8388608K), 0.6312313 secs] 1521173K->172662K(10276096K), [CMS Perm : 24831K->24825K(24896K)], 0.6314259 secs] [Times: user=0.70 sys=0.06, real=0.64 secs] 
261.411: [Full GC261.411: [CMS: 172662K->27686K(8388608K), 0.2232379 secs] 876118K->27686K(10276096K), [CMS Perm : 30913K->30911K(41376K)], 0.2234464 secs] [Times: user=0.22 sys=0.00, real=0.22 secs] 
441.716: [Full GC441.716: [CMS: 27686K->24604K(8388608K), 0.1768511 secs] 500368K->24604K(10276096K), [CMS Perm : 30937K->30936K(51520K)], 0.1770344 secs] [Times: user=0.17 sys=0.00, real=0.17 secs]

Some time later:

20235.044: [Full GC20235.044: [CMS: 195484K->248016K(8388608K), 1.0221468 secs] 798601K->248016K(10276096K), [CMS Perm : 52564K->52564K(87480K)], 1.0223571 secs] [Times: user=1.00 sys=0.02, real=1.01 secs] 
20431.860: [Full GC20431.860: [CMS: 248016K->404539K(8388608K), 1.6120188 secs] 1385622K->404539K(10276096K), [CMS Perm : 54537K->54536K(87672K)], 1.6122663 secs] [Times: user=1.58 sys=0.03, real=1.61 secs] 
20655.867: [Full GC20655.867: [CMS: 404539K->323692K(8388608K), 1.4389098 secs] 958490K->323692K(10276096K), [CMS Perm : 54655K->54652K(90896K)], 1.4391006 secs] [Times: user=1.43 sys=0.00, real=1.43 secs] 
20956.441: [Full GC20956.441: [CMS: 323692K->378663K(8388608K), 1.4717112 secs] 1233969K->378663K(10276096K), [CMS Perm : 54713K->54713K(91092K)], 1.4719180 secs] [Times: user=1.47 sys=0.00, real=1.47 secs] 
21149.053: [Full GC21149.053: [CMS: 378663K->284427K(8388608K), 1.1860728 secs] 717555K->284427K(10276096K), [CMS Perm : 54828K->54828K(91284K)], 1.1862890 secs] [Times: user=1.19 sys=0.00, real=1.19 secs] 
21438.332: [Full GC21438.332: [CMS: 284427K->519857K(8388608K), 2.0659280 secs] 1676708K->519857K(10276096K), [CMS Perm : 62993K->62984K(91476K)], 2.0662088 secs] [Times: user=2.03 sys=0.03, real=2.06 secs] 
21738.902: [Full GC21738.903: [CMS: 519857K->470206K(8388608K), 1.9628709 secs] 1590878K->470206K(10276096K), [CMS Perm : 63230K->63229K(104976K)], 1.9631042 secs] [Times: user=1.95 sys=0.00, real=1.95 secs] 
21922.221: [Full GC21922.221: [CMS: 470206K->448817K(8388608K), 1.7216487 secs] 1127397K->448817K(10276096K), [CMS Perm : 63269K->63269K(105384K)], 1.7218965 secs] [Times: user=1.72 sys=0.00, real=1.72 secs] 
22161.837: [Full GC22161.837: [CMS: 448817K->435548K(8388608K), 1.7168703 secs] 943145K->435548K(10276096K), [CMS Perm : 63298K->63298K(105576K)], 1.7170921 secs] [Times: user=1.72 sys=0.00, real=1.72 secs] 
22402.100: [Full GC22402.100: [CMS: 435548K->613005K(8388608K), 2.3503697 secs] 1841133K->613005K(10276096K), [CMS Perm : 65977K->65976K(105576K)], 2.3506446 secs] [Times: user=2.29 sys=0.05, real=2.34 secs] 
22642.893: [Full GC22642.893: [CMS: 613005K->621214K(8388608K), 2.3825301 secs] 1710749K->621214K(10276096K), [CMS Perm : 66925K->66924K(109964K)], 2.3827835 secs] [Times: user=2.37 sys=0.00, real=2.37 secs] 
22883.148: [Full GC22883.148: [CMS: 621214K->600821K(8388608K), 2.4086658 secs] 1523224K->600821K(10276096K), [CMS Perm : 67550K->67550K(111544K)], 2.4089147 secs] [Times: user=2.40 sys=0.00, real=2.40 secs]

At the peak of gc duration:

51603.282: [Full GC51603.282: [CMS: 4940385K->4859068K(8388608K), 16.6679541 secs] 5987385K->4859068K(10276096K), [CMS Perm : 114750K->114750K(153600K)], 16.6684484 secs] [Times: user=16.66 sys=0.00, real=16.66 secs] 
51813.482: [Full GC51813.482: [CMS: 4859068K->4900085K(8388608K), 16.4928686 secs] 6116576K->4900085K(10276096K), [CMS Perm : 114750K->114750K(153600K)], 16.4933047 secs] [Times: user=16.49 sys=0.00, real=16.49 secs] 
52026.148: [Full GC52026.148: [CMS: 4900085K->4896483K(8388608K), 16.8907197 secs] 6090675K->4896483K(10276096K), [CMS Perm : 114755K->114755K(153600K)], 16.8911947 secs] [Times: user=16.88 sys=0.00, real=16.89 secs] 
52223.160: [Full GC52223.160: [CMS: 4896483K->5031826K(8388608K), 17.1507980 secs] 6559062K->5031826K(10276096K), [CMS Perm : 114763K->114763K(153600K)], 17.1513112 secs] [Times: user=17.16 sys=0.00, real=17.16 secs] 
52425.802: [Full GC52425.802: [CMS: 5031826K->5016422K(8388608K), 17.8492606 secs] 6444936K->5016422K(10276096K), [CMS Perm : 114811K->114811K(153600K)], 17.8497294 secs] [Times: user=17.85 sys=0.00, real=17.85 secs] 
52628.857: [Full GC52628.858: [CMS: 5016422K->5077701K(8388608K), 17.9003775 secs] 6606932K->5077701K(10276096K), [CMS Perm : 114812K->114812K(153600K)], 17.9027559 secs] [Times: user=17.88 sys=0.02, real=17.89 secs] 
52747.442: [GC52747.443: [ParNew: 1677824K->209664K(1887488K), 1.3771078 secs] 6755525K->5530527K(10350324K), 1.3774630 secs] [Times: user=1.64 sys=0.08, real=1.37 secs] 
52833.644: [Full GC52833.644: [CMS: 5320863K->4956574K(8462836K), 17.7157636 secs] 6575478K->4956574K(10350324K), [CMS Perm : 114826K->114826K(153600K)], 17.7162266 secs] [Times: user=17.74 sys=0.00, real=17.72 secs] 
53058.481: [Full GC53058.482: [CMS: 4956574K->5002967K(8462836K), 17.9367047 secs] 6626713K->5002967K(10350324K), [CMS Perm : 114835K->114835K(153600K)], 17.9371138 secs] [Times: user=17.92 sys=0.00, real=17.92 secs]

Full Log: http://pastebin.ca/3409912

With no details at all about the application architecture I don't think anybody can even begin to guess. — Jim Garrison, Mar 24 '16 at 07:40
added more details; what bothers me is that there is always a full GC happening (even shortly after starting the software) — Gishan, Mar 24 '16 at 07:50
Show top or vmstat info. You need to know that it takes the computer resources. It may be iowait for example — hacker13ua, Mar 24 '16 at 08:04
Sorry hacker13ua, but I dont understand. What does top or vmstat info mean? This is a Java console application that runs under windows 24 hours and is restarted at midnight. — Gishan, Mar 24 '16 at 08:12
it seems that your heap size is too big for its original needs. 8~10g's xmx , and 2.5g~4g's xmn may fits better. — Persia, Mar 24 '16 at 08:29
@Persia I've now changed Xmn to 4g and lowered Xmx to 10g. We'll see tomorrow if that helps. Thanks for your input! — Gishan, Mar 24 '16 at 09:24
As best practice, set Xmx and Xms should be same because it will prevent JVM from re sizing the heap. Try set both of this to 8g — Sumanta Pakira, Mar 24 '16 at 15:10
Preventing heap resizing is hardly best practice. Dynamic sizing allows the JVM to meet some of its goals more easily. — the8472, Mar 24 '16 at 17:09
@Gishan when use cms, it's better to turn off UseAdaptiveSizePolicy, adaptiveSizePolicy is for parallel gc. what's the value of CMSInitiatingOccupancyFraction and CMSMaxAbortablePrecleanTime — Persia, Mar 25 '16 at 09:45
@Gishan what's the value of UseCMSInitiatingOccupancyOnly and DisableExplicitGC ? — Persia, Mar 25 '16 at 09:52
@Persia Both values are not set explicitly the jvm manages them itself — Gishan, Mar 25 '16 at 10:13
@Gishan can you add -XX:+PrintTenuringDistribution and -XX:+PrintHeapAtGC and -XX:+PrintFlagsFinal to your jvm args. then check the value of XX:CMSInitiatingPermOccupancyFraction — Persia, Mar 27 '16 at 13:48

Konstantin Pavlov · Answer 1 · 2016-03-24T08:28:55.050

0

A memory leak could cause that pauses. When the system has warmed-up you may take a snapshot of a heap (heap dump) and analyze it with tool like Eclipse Memory Analyzer. You may attach to running process with jvisualvm and try to sample/analyze memory usage. Most efficient fix is to optimize the algorithm.

edited Mar 24 '16 at 08:28

answered Mar 24 '16 at 08:23

Konstantin Pavlov

789
8
23

Thanks, I set up everything for visualVM. I'll geht some heap dumps tomorrow and try to figure out if there is a memory leak. What I still don't understand is why almost all GCs are FullGCs... even shortly after jvm start. – Gishan Mar 24 '16 at 09:28
The heap is full and collector can't free enough space. Since there it's still not enough space in the heap, FullGC is triggered. Again and again... – Konstantin Pavlov Mar 24 '16 at 13:56
Yeah, I understand the concept of minor/major GC. But the heap isn't even near full - the 16g are never used. Especially not directly after starting the software... – Gishan Mar 24 '16 at 15:39
Now that I have VisualVM running: How do I get the difference between to heap dumps in regard to the new instances. I can only see the instance count / size that differs. But that doesn't help much - i.e. a difference in 5000 HashMap entries is interesting, but I need to know what they are (can't search through all of them manually) – Gishan Mar 25 '16 at 08:02
@Gishan, take a heapdump and try to analyze it with Memory Analyzer – Konstantin Pavlov Mar 25 '16 at 08:45
Memory Analyzer seems nice (better performance then VisualVM) but I cant find a simple option to get the object instances that are different between two dumps – Gishan Mar 25 '16 at 10:15
Try to find the biggest HashMap and references to it. It may give you an idea what where in the code the HashMap is used. Then analyze the code why the map grows. – Konstantin Pavlov Mar 25 '16 at 10:54
I dont see a way to analyze it manually. There are several thousand HashMap entries and some hundred HashMaps. I have to get the difference between the dumps by a tool. – Gishan Mar 25 '16 at 11:31

score 0 · Answer 2 · answered Mar 24 '16 at 14:42

0

What could cause these problems?

Add -XX:+PrintGCCause to see what triggers the full GC.

answered Mar 24 '16 at 14:42

the8472

35,110
4
54
107

Thanks I've added this also. I'll update tomorrow with the resulting logs. – Gishan Mar 24 '16 at 15:49

score 0 · Answer 3 · answered Mar 24 '16 at 17:42

When CMS fails, it falls back to a single threaded collector (you can see this by the real == user time) This you want to avoid.

I suspect your Survivor spaces are not large enough so that when the Eden space is cleaned up there isn't enough space in the survivors and this triggers a full GC, single threaded.

CMS Garbage Collection taking too long

3 Answers3