2

We are using heavy multi threading in a Swing application or extensive calculations. From time to time it can happen that the application runs against an OOME and can not create any native threads any more. I absolutely understand that the application has to be aware of this and it is bad by design then, however it can not be avoided 100%. The problem is that in such a case the JVM is absolutely lost because it can not handle the error and the system is behaving non predictable. Usually we log every memory error and restart the application by -XX:OnOutOfMemoryError="kill -9 %p", however this does not work for obvious reason. On the other hand it is a bit frustrating the JVM has no control any more. So what might be a good way to come around this kind of problem?

PS: I do not search for a solution like extending systems process limits or reducing thread stack size via Xss. I am looking for an approach how to handle in general.

Thomas
  • 1,325
  • 13
  • 22
  • 1
    Hint: be careful about getting to comfortable with workarounds. When your design / code is broken and contains a bug, then each minute spent on something that doesn't contribute to *fixing* that bug is a risky investment. – GhostCat Jul 17 '17 at 11:16
  • @GhostCat Thanks for the hint, however sometimes the environment is the problem. For instance we had a problem in newer Linux thread limits being to low (default of systemd). So we had beautiful code simply not running in an other environment. – Thomas Jul 18 '17 at 12:41

1 Answers1

2

The JVM has perfect control over OutOfMemoryErrors and handles it gracefully, what does not handle it gracefully is your program. You can catch and handle an OutOfMemoryError in the same way as every other error, just that most programs never do that.

To solve your problem you should first try to pinpoint the root of those memory errors, for example by logging them, or by using performance/memory analysis tools. Also enforcing a core-dump in these cases can be useful, which then allows you to analyze the root cause at the moment it happened.

In the end redesigning the application will be necessary to avoid OOM errors by limiting the amount of memory used. This can either be done by testing how many threads the program can gracefully handle and then enforcing that limit, or by checking free memory before creating a new thread. Also architectural changes might help, but you posted no details about the internals, so I can't give any advise here.

TwoThe
  • 12,597
  • 3
  • 26
  • 50
  • They can be handled like any other Error as well with the only exception that you need to catch at a moment where the GC is able to free enough memory so your code can run. So somewhere outside of the scope where variables holding lots of memory are declared. – TwoThe Jul 17 '17 at 11:50
  • In real life architecture would ensure that there is no OOM error (like using weak object links on really big objects, or process on disk/sequentially). Calling gc is not required, the JVM will automatically try to free memory to function and throw OOM if that fails. – TwoThe Jul 17 '17 at 11:54
  • You recommend catching this exception in real life. So what do you suggest to put into that catch() clause then? – GhostCat Jul 17 '17 at 11:55
  • Catch and log as usual. The only difference is that you must catch outside of scope as above so memory can be made available. In general I have something like a `catch (Error e) { log.error("General failure: ", e); }` clause at the outer-most scope. This is so I know that something went horribly wrong, yet the server in general (but not the individual task) can continue. – TwoThe Jul 17 '17 at 11:58
  • I see; but still: from the point of an individual user interacting with that server - what happens there? And do you do things like "counting" such incidents; so that you maybe still shutdown when you got 5 such errors in 1 minute? – GhostCat Jul 17 '17 at 12:08
  • The session with the individual user would be terminated, but there is no way to easily continue that session anyways. This is called an Error for a reason, because it is an error in the code, not a state that should be handled. How that error will then be handled on operations side is another question and completely depends on the application and how critical it is that it is up and running. Technically the application can continue to run (mostly fine), but restarting the server sounds like a god idea. But then such errors should always be followed by an investigation. – TwoThe Jul 17 '17 at 12:14
  • As I stated in my question the OOM by threading is a special issue. When the limit is reached you can not handle anything nor you can dump the heap. I am sorry to say clearly but Java does not have perfect control then. – Thomas Jul 17 '17 at 14:32
  • I have been managing several Java projects that had memory issues while being heavily threaded and never encountered any issue with catching OOM. I assume that there is more to that bug that just threading. – TwoThe Jul 17 '17 at 14:52