Home > Software engineering >  Why JVM can recovery from OOM Java heap space by itself
Why JVM can recovery from OOM Java heap space by itself

Time:07-05

Integer[][] data = new Integer[1000000][100000];

As above simple demo code, i try to apply remarkable big memory and trigger OOM in pandora container(it is an alibaba developed web container, like tomcat)。but seems this error ONLY affect the current request, web service do NOT be collapsed; As i known, different from exception, error in java should NOT recovery, and affect the whole process。i was puzzled, please advice..thanks.

CodePudding user response:

The error OutOfMemoryError: Java heap space is similar to any other exception, it only causes the current thread to be terminated. It's only about running out of Java heap space. So if the thread is killed and all objects created in this thread become unreachable and may be collected, there's enough heap space to continue, and there's no reason for the whole application to crash.

The application itself will only be killed by the operating system if it exceeds the memory that the operating system can provide. There are however some keys you can use in some JVMs to explicitly crash on any out-of-memory error, e.g. -XX: CrashOnOutOfMemoryError.

CodePudding user response:

First this:

As I known, different from exception, error in java should NOT recovery, and affect the whole process.

In general that is true. In the case of a web container, recovery from OOME on a request thread has a better chance of succeeding than for a typical multi-threaded application.

Why?

Because the work done in to handle one web request on one worker thread is typically independent of other threads. That means that an OOME in a request is less likely to leave shared data structures in an inconsistent state.

But you still have the problem that the root cause of the OOME could be a memory leak ... and that most likely won't go away when the web container cleans up the request thread and creates a new one. Hence it is still dubious for a web container to recover from OOMEs.

But this is fairly common behavior anyway. I think that the reasoning is that attempting to recover with a reasonable chance of succeeding is better than failing fast.


So why is it possible to recover at all?

Consider this snippet:

   public void test() {
       try {
           Integer[][] data = new Integer[1000000][100000];
       } catch (OutOfMemoryError ex) {
           // log it
       }
       // do something else
   }

Observations:

  1. An OOME happens after the GC has run. The typical sequence of JVM actions leading up to an OOME is something like this:

    • Attempt to allocate large object
    • Find there is not enough free space
    • Run a new space GC
    • Try the allocation again.
    • Still not enough free space
    • Run a full GC
    • Try the allocation again.
    • Still not enough free space
    • Throw OOME
  2. The new Integer[10000][10000] is an all-or-nothing thing. If it triggers an OOME, then the objects that it has allocated so far will all be unreachable. So if the // do something else code tries to allocate another object, and the heap is still full, then the JVM will run the GC again ... which will reclaim those unreachable objects ... and we are back in business.

  3. Even if it wasn't ... when data goes out of scope, the tree of Integer[][] and Integer[] objects that it refers to may now be detectable as unreachable by the GC.

  4. If the OOME is thrown on a child thread, and the thread is allowed to die, then all of the thread's local variables will no longer be reachable. That results in more (potentially) unreachable objects.

The point is that at the point you catch and recover from the OOME, there is likely to be some collectable garbage.


So if it is possible to recovery, why do people advise against recovering from OOMEs at all?

  1. Because the OOME's cause may be a memory leak. Recovering from an OOME that is caused by a memory leak can result in poor performance. The heap will eventually fill up to a point where GC takes far too much time.

  2. Because an OOME can lead to a data structure being left in an inconsistent state; e.g. your code was updating it when it got the OOME.

  3. Because an OOME can break concurrent behaviors. For example, suppose thread A is waiting for a notify from thread B. If B gets an OOME, it may die completely, or it may attempt to recover ... at a point where its lock has been released. Either way, there is a risk that thread A will be stuck for ever waiting for a notify that never will happen. (Thread B should probably trigger an application shutdown to avoid this.)

  • Related