14

I have a strange issue with one of my production machines. It hosts a Java application that does CMS (concurrent mark and sweep), but it cleans up just a small part of the old generation. I suspected a memory leak and tried a heap dump. But the Full GC preceding the heap dump cleans up almost all of the old generation. What's happening? I've never seen this behavior of Java garbage collection. Normally CMS and Full GC should collect about the same amount of garbage, now CMS keeps about 10GB more.

  • Java 1.7.0_75
  • Linux Cent OS 7

GC logs:

**2016-01-04T07:37:40.196+0000: 431200.698: [GC [1 CMS-initial-mark: 21633423K(27336704K)] 22826703K(30101504K), 4.3910840 secs] [Times: user=4.40 sys=0.01, real=4.39 secs] 
2016-01-04T07:37:44.588+0000: 431205.090: [CMS-concurrent-mark-start]
2016-01-04T07:38:08.718+0000: 431229.220: [CMS-concurrent-mark: 18.213/24.131 secs] [Times: user=126.00 sys=2.22, real=24.13 secs] 
2016-01-04T07:38:08.718+0000: 431229.220: [CMS-concurrent-preclean-start]
2016-01-04T07:38:08.843+0000: 431229.345: [CMS-concurrent-preclean: 0.118/0.125 secs] [Times: user=0.29 sys=0.00, real=0.12 secs] 
2016-01-04T07:38:08.843+0000: 431229.345: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2016-01-04T07:38:13.906+0000: 431234.408: [CMS-concurrent-abortable-preclean: 4.975/5.063 secs] [Times: user=10.18 sys=0.03, real=5.06 secs] 
2016-01-04T07:38:14.001+0000: 431234.503: [GC[YG occupancy: 1312993 K (2764800 K)]2016-01-04T07:38:14.001+0000: 431234.503: [Rescan (parallel) , 4.6981760 secs]2016-01-04T07:38:18.699+0000: 431239.202: [weak refs processing, 0.0002970 secs]2016-01-04T07:38:18.700+0000: 431239.202: [scrub string table, 0.0006900 secs] [1 CMS-remark: 21633423K(27336704K)] 22946417K(30101504K), 4.6993310 secs] [Times: user=105.40 sys=0.36, real=4.70 secs] 
2016-01-04T07:38:18.701+0000: 431239.203: [CMS-concurrent-sweep-start]
2016-01-04T07:38:27.967+0000: 431248.469: [CMS-concurrent-sweep: 9.160/9.267 secs] [Times: user=17.91 sys=0.10, real=9.26 secs] 
2016-01-04T07:38:27.968+0000: 431248.470: [CMS-concurrent-reset-start]
2016-01-04T07:38:28.028+0000: 431248.531: [CMS-concurrent-reset: 0.061/0.061 secs] [Times: user=0.14 sys=0.00, real=0.06 secs]** 
2016-01-04T07:38:30.801+0000: 431251.303: [GC [1 CMS-initial-mark: 21633105K(27336704K)] 23039228K(30101504K), 5.6079370 secs] [Times: user=5.60 sys=0.01, real=5.61 secs] 
2016-01-04T07:38:36.409+0000: 431256.911: [CMS-concurrent-mark-start]
2016-01-04T07:38:54.673+0000: 431275.175: [CMS-concurrent-mark: 17.807/18.264 secs] [Times: user=119.97 sys=1.66, real=18.26 secs] 
2016-01-04T07:38:54.673+0000: 431275.175: [CMS-concurrent-preclean-start]
2016-01-04T07:38:54.799+0000: 431275.301: [CMS-concurrent-preclean: 0.119/0.126 secs] [Times: user=0.25 sys=0.00, real=0.13 secs] 
2016-01-04T07:38:54.799+0000: 431275.301: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2016-01-04T07:39:01.382+0000: 431281.884: [CMS-concurrent-abortable-preclean: 6.478/6.583 secs] [Times: user=12.23 sys=0.06, real=6.58 secs] 
2016-01-04T07:39:01.487+0000: 431281.989: [GC[YG occupancy: 1596183 K (2764800 K)]2016-01-04T07:39:01.487+0000: 431281.989: [Rescan (parallel) , 3.5737630 secs]2016-01-04T07:39:05.061+0000: 431285.563: [weak refs processing, 0.0002690 secs]2016-01-04T07:39:05.061+0000: 431285.563: [scrub string table, 0.0005740 secs] [1 CMS-remark: 21633105K(27336704K)] 23229288K(30101504K), 3.5747910 secs] [Times: user=80.26 sys=0.27, real=3.58 secs] 
2016-01-04T07:39:05.062+0000: 431285.564: [CMS-concurrent-sweep-start]
2016-01-04T07:39:21.895+0000: 431302.397: [CMS-concurrent-sweep: 9.449/16.834 secs] [Times: user=27.52 sys=0.16, real=16.83 secs] 
2016-01-04T07:39:21.895+0000: 431302.397: [CMS-concurrent-reset-start]
2016-01-04T07:39:21.995+0000: 431302.497: [CMS-concurrent-reset: 0.099/0.099 secs] [Times: user=0.23 sys=0.01, real=0.10 secs] 
2016-01-04T07:39:24.104+0000: 431304.606: [GC [1 CMS-initial-mark: **21631742K(27336704K)**] 23455261K(30101504K), 5.6592940 secs] [Times: user=5.67 sys=0.00, real=5.66 secs] 
2016-01-04T07:39:29.764+0000: 431310.266: [CMS-concurrent-mark-start]
2016-01-04T07:39:56.859+0000: 431337.361: [CMS-concurrent-mark: 19.813/27.096 secs] [Times: user=140.17 sys=2.62, real=27.10 secs] 
2016-01-04T07:39:56.859+0000: 431337.361: [CMS-concurrent-preclean-start]
2016-01-04T07:39:57.102+0000: 431337.604: [CMS-concurrent-preclean: 0.152/0.242 secs] [Times: user=0.35 sys=0.00, real=0.24 secs] 
2016-01-04T07:39:57.102+0000: 431337.604: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2016-01-04T07:40:03.016+0000: 431343.518: [CMS-concurrent-abortable-preclean: 5.376/5.914 secs] [Times: user=11.65 sys=0.05, real=5.91 secs] 
2016-01-04T07:40:03.016+0000: 431343.518: [GC[YG occupancy: 2045175 K (2764800 K)]2016-01-04T07:40:03.016+0000: 431343.518: [Rescan (parallel) , 3.4843400 secs]2016-01-04T07:40:06.501+0000: 431347.003: [weak refs processing, 0.0002510 secs]2016-01-04T07:40:06.501+0000: 431347.003: [scrub string table, 0.0006220 secs] [1 CMS-remark: **21631742K(27336704K)**] 23676918K(30101504K), 3.4853760 secs] [Times: user=78.31 sys=0.27, real=3.49 secs] 
2016-01-04T07:40:06.502+0000: 431347.004: [CMS-concurrent-sweep-start]
{Heap before GC invocations=1832 (full 5077):
 par new generation   total 2764800K, used 2166647K [0x00000000bae00000, 0x0000000176600000, 0x0000000176600000)
  eden space 2457600K,  77% used [0x00000000bae00000, 0x000000012e6a4c90, 0x0000000150e00000)
  from space 307200K,  89% used [0x0000000150e00000, 0x0000000161939318, 0x0000000163a00000)
  to   space 307200K,   0% used [0x0000000163a00000, 0x0000000163a00000, 0x0000000176600000)
 concurrent mark-sweep generation total 27336704K, used 21631644K [0x0000000176600000, 0x00000007fae00000, 0x00000007fae00000)
 concurrent-mark-sweep perm gen total 49956K, used 30116K [0x00000007fae00000, 0x00000007fdec9000, 0x0000000800000000)
**2016-01-04T07:40:12.775+0000: 431353.277: [Full GC2016-01-04T07:40:12.775+0000: 431353.277: [CMS2016-01-04T07:40:17.924+0000: 431358.426: [CMS-concurrent-sweep: 9.211/11.422 secs] [Times: user=16.59 sys=0.15, real=11.42 secs] 
 (concurrent mode interrupted): 21631644K->4907878K(27336704K), 39.2467600 secs] 23798292K->4907878K(30101504K), [CMS Perm : 30116K->28023K(49956K)], 39.2468730 secs] [Times: user=39.24 sys=0.05, real=39.25 secs]** 

The same application is running normally on another machine using Cent OS 5, java 7.

java -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=1581
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false -verbose:gc -XX:+PrintGCDetails
-XX:+UseCompressedOops -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC
-Xloggc:/usr/local/app/logs/current-gc.log -Xms29g -Xmx29g -XX:NewSize=3000m
-XX:SurvivorRatio=8 -XX:CMSInitiatingOccupancyFraction=70
-XX:+UseCMSInitiatingOccupancyOnly
-Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFFE
-Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFFE
-cp /usr/local/app/conf:/usr/local/app/app.jar:/usr/local/app/lib/* -Xdebug
-Xrunjdwp:transport=dt_socket,address=8099,server=y,suspend=n
-Dvisualvm.display.name=App -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/usr/local/app/logs/
-XX:ErrorFile=/usr/local/app/logs/hs_err_pid%p.log
-Djgroups.tcpgossip.initial_host=localhost
-Dlog4j.hostname=host7.company.com com.company.app.service.App

UPDATE: The problem is still not solved. I've tried everything: update OS packages and kernel, update Java to latest version Java 1.7.0_80, rollback the application version, but without success.

I've also verified previous GC logs and found that this issue was not forever. It started about a month ago, after a deployment.

Aditya W
  • 636
  • 8
  • 18
dcernahoschi
  • 13,894
  • 5
  • 31
  • 55
  • @ravindra The question is not related to G1. – dcernahoschi Dec 23 '15 at 09:57
  • Have a look at this one:http://stackoverflow.com/questions/3873635/java-concurrentmarksweep-garbage-collector-not-removing-all-garbage – Ravindra babu Dec 23 '15 at 10:02
  • I thought that you were looking for explanation on why this happening. Have you tried to ssh and profile it directly on the machine? Regarding updated question I agree with @ravindra with his link that it could be a bug with jmx. – Yuri Dec 23 '15 at 10:10
  • 1
    I have tried many combinations in CMS and finally moved to G1gc. For 29gb memory, I can't see any other algorithm better than G1GC. – Ravindra babu Dec 23 '15 at 10:10
  • @ravindra Interesting, but there was no remote jconsole connected to that jvm and there are no "Unloading ..." messages in the gc logs. – dcernahoschi Dec 23 '15 at 10:11
  • @ravindra yes. But it is still very young algorithm and in JDK8 they were not sure that it should be default one. But anyway in practice for large heaps especially it seems to be the best one. – Yuri Dec 23 '15 at 10:12
  • I have found one more interesting one :http://googleweblight.com/?lite_url=http://stackoverflow.com/questions/5322761/java-cms-gc-behaviours&ei=0pZoc1YD&lc=en-IN&geid=9&s=1&m=506&ts=1450865842&sig=ALL1Aj5JXBHE1xCQ6HFmAEd4YJkNk97Ocg – Ravindra babu Dec 23 '15 at 10:19
  • Full GC starts when the tenured space is full, or the suvivor space is exhaused (e.g. there are too many object copied from the eden space) or the CMS desices now is a good tile to try and do a concurrent cleanup. – Ravindra babu Dec 23 '15 at 10:20
  • Take a look also at [Minor GV vs Major GC vs Full GC post](https://plumbr.eu/blog/garbage-collection/minor-gc-vs-major-gc-vs-full-gc) – Yuri Dec 23 '15 at 10:26
  • is CMS class unloading enabled? i don't recall whether that was available/enabled by default in 7. – the8472 Dec 23 '15 at 12:29
  • 1.7.0_80 is not the latest java version – the8472 Jan 27 '16 at 11:20
  • Is there any way you could upload more logs. I'm not advocating for pasting it into the question, itself, but this Q/A on meta (http://meta.stackexchange.com/questions/15821/stack-overflow-etiquette-for-sharing-files-and-resources) gives some decent suggestions for what I'm talking about. – entpnerd Jan 29 '16 at 13:47
  • @dcernahoschi Apart from the differing versions of the OS, are the environments on both hosts (and the hosts themselves) equivalent? – entpnerd Jan 30 '16 at 02:34
  • The hardware is also different. They are 2 physical, not virtual machines. The one with the issue is newer. But no other differences. – dcernahoschi Jan 30 '16 at 09:28

4 Answers4

3

I suspect that need of Major collection has not been come up in your application yet due to very large heap : Neither tenured Old Gen nor Survivor Space has been exhausted.

If you think otherwise, ( Major collection has not been triggered even though either of conditions have been met ), share you gclog statements.

On different note : since you are using large heap, G1GC has been preferred.

EDIT:

-XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly 

By default CMS GC uses set of heuristic rules to trigger garbage collection. This makes GC less predictable and usually tends to delay collection until old generation is almost occupied.

-XX:+UseCMSInitiatingOccupancyOnly prevent usage of GC heuristics.

-XX:CMSInitiatingOccupancyFraction informs Java VM when CMS should be triggered.

You have configured it as 70%. If your OldGen [Max heap (29G) - new Gen (3G)] touches 70% of limit, GC will be triggered. As part of GC, first minor GC is triggered. If minor GC did not release enough space to allocate new objects at that juncture, then Major GC or Full GC is triggered.

Have a look at this article for more details.

I have found one more related SE question : CMS garbage collector - when does it run?. Have a look at this one too.

Community
  • 1
  • 1
Ravindra babu
  • 42,401
  • 8
  • 208
  • 194
  • From your logs, concurrent mark-sweep generation total 27336704K, used 21631644K implies 21G out of 27G has been used. Still 6G of memory is available. So FullGC is not required. – Ravindra babu Jan 04 '16 at 10:47
  • The idea is that the automatically triggered CMS is not cleaning up memory, but the manually invoked Full GC does. Why? – dcernahoschi Jan 04 '16 at 10:52
  • I am assuming that FullGC log is generated when you manually invoked FullGC. Without manual intervention, CMS did not trigger FullGC at all. Correct me if I am wrong. – Ravindra babu Jan 04 '16 at 11:03
  • CMS did not trigger full gc since free memory of 6GB is already available. CMS algorithm has been optimised to reduce FullGC calls. – Ravindra babu Jan 04 '16 at 11:21
3

CMS does not unload classes by default in JDK 7.
You have to specify -XX:+CMSClassUnloadingEnabled, or better switch to JDK 8.

BTW, if you run jmap without live argument, it won't call Full GC before creating a heap dump.

apangin
  • 79,047
  • 9
  • 168
  • 200
  • Indeed you can take the dump without full gc and it seems it contains unreachable objects. Here is the log from Eclipse MAT `[INFO] Removed 308,720,881 unreachable objects using 12,603,944,736 bytes`. – dcernahoschi Feb 01 '16 at 21:02
3

So there are a few possible causes to your woes and I'll state the likely quick fix first, and then state some best practices below that you should consider doing in the long term.

  1. Likely quick fix: The "CMS: abort preclean due to time" entry of your logs suggest that your CMS collector isn't tuned properly to do it's job before the ultimate stop-the-world full GC phase. Basically what this log message means is that the preclean phase is timing out and further steps of the CMS collector aren't able to run. The default for the time to abort the preclean phase is 5 seconds, which appears to be the case from your logs. Consider increasing this time to something like 10 seconds, and observe what happens. You should be able to do this by adding the parameter -XX:CMSMaxAbortablePrecleanTime=10000. See Misamitsu's blog entry on the CMSMaxAbortablePrecleanTime tuning parameter. Also, based on what I can see from Stas' blog entry on the subject, the unit is milliseconds, not seconds.

Try these longer term things in the long term:

  1. If you can, try the G1GC collector. You are using Java 7, so if you can in your current environment, give it a try. It's meant for performance with big heaps.
  2. The fact that this happened as a result of a deployment and the rollback didn't work suggests to me that:
    • Your entire app (such as JVM tuning parameters) aren't captured as part of your deployment, or...
    • The cause of the performance issues are unrelated to your app. Perhaps the differences between your two production hosts and an unrelated app running on one host such as a different antivirus version is making life more difficult for your garbage collector. So again, if you can, deploy to exactly the same hardware and software.
  3. Evaluate whether you really need such a large heap. Remember, the bigger the heap, the longer the garbage collector takes, irrespective of whether you are using a concurrent collector such as CMS.
  4. You have a lot of tuning parameters and I would challenge that you need the majority of them. Consider running a benchmark experiment where you just obliterate the -XX:NewSize=3000m, -XX:SurvivorRatio=8, -XX:CMSInitiatingOccupancyFraction=70, -XX:+UseCMSInitiatingOccupancyOnly, -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFFE, and -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFFE parameters and see if your situation improves.

I hope that helps, please comment with follow up questions if needed.

entpnerd
  • 8,063
  • 5
  • 36
  • 60
  • Regarding the quick fix: the other machine that works normally displays in the logs the same 'CMS: abort preclean due to time'. But yes, it might be an issue under different circumstances. – dcernahoschi Feb 01 '16 at 21:10
  • Regarding 1. CMS proved more predictable and we have no major issues with it until now. – dcernahoschi Feb 01 '16 at 21:12
  • Regarding 2. Most likely the second option is the case because I had no issue on any other machine than this one. – dcernahoschi Feb 01 '16 at 21:13
  • Regarding 3. Under present design, yes, the heap holds a large cache for fast access. And again, the gc times were always good. – dcernahoschi Feb 01 '16 at 21:15
  • Regarding 4. These parameters are used on every deployment, they were tuned at some point and proved good. – dcernahoschi Feb 01 '16 at 21:16
  • This answer show the best understanding of the problem and seems most helpful. I'll play a little with the `-XX:CMSMaxAbortablePrecleanTime` parameter to see if it makes any difference. – dcernahoschi Feb 01 '16 at 21:19
  • Let me know if you have any further issues and/or questions. I'd be happy to help - irrespective of points. – entpnerd Feb 01 '16 at 21:28
  • You write: "Basically what this log message means is that the preclean phase is timing out and further steps of the CMS collector aren't able to run." Do you have a reference for this claim? Everybody else says that "CMS: abort preclean due to time" is no big deal, preclean is designed to be aborted, and CMS will continue with the next phase. Which one is it then? I'm also seeing this problem, and it *looks* like CMS doesn't finish the cleanup (esp. weak references), old collection doesn't go down at all. – Peter Štibraný Apr 28 '17 at 11:57
2

There is the reason why this happening. It seems that when you are trying to do heap dump it is calling method dumpHeap(). For example VisualVM triggers full GC before making heap dump too.

From documentation:

void dumpHeap(String outputFile, boolean live) throws IOException

Dumps the heap to the outputFile file in the same format as the hprof heap dump. If this method is called remotely from another process, the heap dump output is written to a file named outputFile on the machine where the target VM is running. If outputFile is a relative path, it is relative to the working directory where the target VM was started.

Parameters:

outputFile - the system-dependent filename

live - if true dump only live objects i.e. objects that are reachable from others

Why?

When you are trying to find memory leak, you don't want to get objects which have no references (was not garbage-collected).

Recommendation

I'm seeing a pretty large heap, probably it is better to use G1.

The first focus of G1 is to provide a solution for users running applications that require large heaps with limited GC latency. This means heap sizes of around 6GB or larger, and stable and predictable pause time below 0.5 seconds.

Take a look at recommendation section also in documentation.

Yuri
  • 1,608
  • 1
  • 22
  • 23