Project

General

Profile

Actions

User story #5617

closed

Detecting and restarting Rudder on OOM (Out Of Memory Exception)

Added by Lionel Le Folgoc over 9 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
N/A
Assignee:
-
Category:
System integration
Target version:
UX impact:
Suggestion strength:
User visibility:
Effort required:
Name check:
Fix check:
Regression:

Description

Hi,

I've a lot of nodes (5000 "test" nodes). The policy generation following their acceptation has been ongoing since yesterday 21:28 (for 12 hours now). This is not normal, it should not have been longer than 4 hours.

In the webapp log, I can see the following exception:

[2014-10-07 21:26:35] INFO  com.normation.rudder.services.policies.DeploymentServiceImpl - Start policy generation, checking updated rules
[2014-10-07 21:28:28] WARN  application - [Store Agent Run Times] Task frequency is set too low! Last task took 74577 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists.
[2014-10-07 21:28:29] ERROR net.liftweb.actor.ActorLogger - Actor threw an exception
java.lang.OutOfMemoryError: Java heap space
        at com.unboundid.util.StaticUtils.toLowerCase(StaticUtils.java:440) ~[unboundid-ldapsdk-2.3.4.jar:2.3.4]
Exception in thread "pool-3-thread-8" java.lang.OutOfMemoryError: Java heap space
        at com.unboundid.util.StaticUtils.toLowerCase(StaticUtils.java:440)
        at com.unboundid.ldap.sdk.Entry.<init>(Entry.java:309)
        at com.unboundid.ldap.sdk.Entry.<init>(Entry.java:284)
        at com.normation.ldap.sdk.LDAPEntry$.apply(LDAPEntry.scala:291)
        at com.normation.ldap.sdk.LDAPEntry$.apply(LDAPEntry.scala:293)
        at com.normation.ldap.sdk.RoLDAPConnection$$anonfun$search$1.apply(LDAPConnection.scala:303)
        at com.normation.ldap.sdk.RoLDAPConnection$$anonfun$search$1.apply(LDAPConnection.scala:303)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at com.normation.ldap.sdk.RoLDAPConnection.search(LDAPConnection.scala:303)
        at com.normation.ldap.sdk.ReadOnlyEntryLDAPConnection$class.search(LDAPConnection.scala:82)
        at com.normation.ldap.sdk.RoLDAPConnection.search(LDAPConnection.scala:283)
        at com.normation.ldap.sdk.ReadOnlyEntryLDAPConnection$class.searchOne(LDAPConnection.scala:142)
        at com.normation.ldap.sdk.RoLDAPConnection.searchOne(LDAPConnection.scala:283)
        at com.normation.rudder.services.nodes.NodeInfoServiceImpl$$anonfun$getAll$1.apply(NodeInfoService.scala:193)
        at com.normation.rudder.services.nodes.NodeInfoServiceImpl$$anonfun$getAll$1.apply(NodeInfoService.scala:189)
        at com.normation.ldap.sdk.LDAPConnectionProvider$$anonfun$map$1.apply(LDAPConnectionProvider.scala:94)
        at com.normation.ldap.sdk.LDAPConnectionProvider$$anonfun$map$1.apply(LDAPConnectionProvider.scala:93)
        at com.normation.ldap.sdk.LDAPConnectionProvider$class.withCon(LDAPConnectionProvider.scala:154)
        at com.normation.ldap.sdk.ROPooledSimpleAuthConnectionProvider.withCon(LDAPConnectionProvider.scala:369)
        at com.normation.ldap.sdk.LDAPConnectionProvider$class.map(LDAPConnectionProvider.scala:93)
        at com.normation.ldap.sdk.ROPooledSimpleAuthConnectionProvider.map(LDAPConnectionProvider.scala:369)
        at com.normation.rudder.services.nodes.NodeInfoServiceImpl.getAll(NodeInfoService.scala:189)
        at com.normation.rudder.services.policies.DeploymentService_findDependantRules_bruteForce$class.getAllNodeInfos(DeploymentService.scala:322)
        at com.normation.rudder.services.policies.DeploymentServiceImpl.getAllNodeInfos(DeploymentService.scala:276)
        at com.normation.rudder.services.policies.DeploymentService$$anonfun$2.apply(DeploymentService.scala:90)
[2014-10-07 21:28:30] INFO  com.normation.rudder.batch.AsyncDeploymentAgent - One automatic policy update process is already pending, ignoring new policy update request

Looks like it actually failed two minutes after the beginning, but rudder is stuck thinking it's still ongoing:

Updating policies (started at 2014-10-07 21:26). Another update is pending since 2014-10-07 21:26

Thanks.


Related issues 5 (0 open5 closed)

Related to Rudder - Bug #2843: Rudder can fail to generate promises when Java is lacking memoryRejectedActions
Related to Rudder - Bug #7524: Java OOM during Java's log migrationReleasedNicolas CHARLES2015-12-01Actions
Related to Rudder - Bug #7735: OOM in Rudder when there are too many repaired reportsReleasedFrançois ARMANDActions
Related to Rudder - Architecture #8923: Requires Java8 (jdk8) for Rudder 4.0RejectedJonathan CLARKE2016-09-07Actions
Related to Rudder - Bug #8165: rudder-init fails to report memory errors from jetty startReleasedBenoît PECCATTEActions
Actions #1

Updated by François ARMAND over 9 years ago

Ah, the kind of things making you love java....

From an user experience point of view, I think it be better to kill the JVM on such scenario, so that at least we don't have to deal with a zombie app (and so, our agent can detect it's no more running and start it again).

Of course, it would be even better to not have such error, I'm talking about a remission scenario here.

Actions #2

Updated by François ARMAND over 9 years ago

  • Category set to System integration
  • Status changed from New to Discussion
  • Assignee set to François ARMAND
  • Target version set to 2.6.19

I propose to add a JVM option to kill the server and restart it when out of memory errors happen

Actions #3

Updated by François ARMAND over 9 years ago

  • Target version changed from 2.6.19 to 140

More information about what we want to do:

Use the "OnOutOfMemoryError" in something like: -XX:OnOutOfMemoryError="kill -9 %p ; RESTART_RUDDER"

Example of use:

    % cat boom.scala
    object Booom {
      def main(args: Array[String]) : Unit = {
        val a = Array.fill(1000000)(Array.fill(100000)(1))
      }
    }

    % scala boom.scala
    java.lang.OutOfMemoryError: Java heap space
            at scala.collection.mutable.ArrayBuilder$ofInt.mkArray(ArrayBuilder.scala:323)
            at scala.collection.mutable.ArrayBuilder$ofInt.resize(ArrayBuilder.scala:329)
            .....

    % scala -J-XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'"  boom.scala
    #
    # java.lang.OutOfMemoryError: Java heap space
    # -XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" 
    #   Executing /bin/sh -c "kill -9 10251 && echo 'such memory, much killness, wow'"...
    such memory, much killness, wow
    /home/fanf/java/scala/bin/scala: line 21: 10251 Killed                  "$@" 

There is one big caveat to that: the command(s) passed to the option use fork() on linux, meaning that they need as much memory as the parent JVM for the fork to succeed, modulo (on Linux) the memory_overcommit profil used on the server. Of course, most of the time, if you get a OutOfMemory Error, that means that you don't have any more memory available for the JVM, and if your JVM is configured to use most of the server memory available, the fork won't be possible under default overcommit configuration on Linux (mode 0, 50% allowed).

Workaround: configure overcommit in mode 1 or with 100%.

See:
- https://bugs.openjdk.java.net/browse/JDK-8027434 for the bug report on that problem.
- https://www.kernel.org/doc/Documentation/vm/overcommit-accounting for explanation of overcommit modes
- https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-captun.html for example about how to configure overcommit on RedHat (that boils donw to: "echo 1 > /proc/sys/vm/overcommit_memory" for temp configuration, use sysctl config for permanent one)

Actions #4

Updated by Matthieu CERDA over 9 years ago

  • Target version changed from 140 to 3.0.0~beta1
Actions #5

Updated by Jonathan CLARKE over 9 years ago

  • Target version changed from 3.0.0~beta1 to 3.0.0~beta2
Actions #6

Updated by François ARMAND over 9 years ago

  • Target version changed from 3.0.0~beta2 to 3.0.0~rc1
Actions #7

Updated by Vincent MEMBRÉ about 9 years ago

  • Target version changed from 3.0.0~rc1 to 3.0.0
Actions #8

Updated by François ARMAND about 9 years ago

  • Target version changed from 3.0.0 to 2.10.9

I have no idea of what we could do for now. All in all, the target is not correct: the bug exists since java exists (or something alike).

Perhaps we should add a system directive looking for occurence of "java.lang.OutOfMemoryError: Java heap space" (or other oom errors), and restarting Rudder in that case ?

Actions #9

Updated by Vincent MEMBRÉ about 9 years ago

  • Target version changed from 2.10.9 to 2.10.10
Actions #10

Updated by Vincent MEMBRÉ about 9 years ago

  • Target version changed from 2.10.10 to 2.10.11
Actions #11

Updated by Vincent MEMBRÉ about 9 years ago

  • Target version changed from 2.10.11 to 2.10.12
Actions #12

Updated by Vincent MEMBRÉ about 9 years ago

  • Target version changed from 2.10.12 to 2.10.13
Actions #13

Updated by Vincent MEMBRÉ almost 9 years ago

  • Target version changed from 2.10.13 to 2.10.14
Actions #14

Updated by Vincent MEMBRÉ almost 9 years ago

  • Target version changed from 2.10.14 to 2.10.15
Actions #15

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 2.10.15 to 2.10.16
Actions #16

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 2.10.16 to 2.10.17
Actions #17

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 2.10.17 to 2.10.18
Actions #18

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 2.10.18 to 2.10.19
Actions #19

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 2.10.19 to 2.10.20
Actions #20

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 2.10.20 to 277
Actions #21

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 277 to 2.11.18
Actions #22

Updated by Vincent MEMBRÉ about 8 years ago

  • Target version changed from 2.11.18 to 2.11.19
Actions #23

Updated by Vincent MEMBRÉ about 8 years ago

  • Target version changed from 2.11.19 to 2.11.20
Actions #24

Updated by Vincent MEMBRÉ almost 8 years ago

  • Target version changed from 2.11.20 to 2.11.21
Actions #25

Updated by François ARMAND almost 8 years ago

  • Subject changed from Aborted policy generation due to out of memory exception is not detected to Detecting and restarting Rudder on OOM (Out Of Memory Exception)
Actions #27

Updated by François ARMAND almost 8 years ago

  • Related to Bug #2843: Rudder can fail to generate promises when Java is lacking memory added
Actions #28

Updated by François ARMAND almost 8 years ago

  • Related to Bug #7524: Java OOM during Java's log migration added
Actions #29

Updated by François ARMAND almost 8 years ago

  • Related to Bug #7735: OOM in Rudder when there are too many repaired reports added
Actions #30

Updated by François ARMAND almost 8 years ago

We could also make the OOM detection be done in a cron, looking in the last log, and restarting Rudder when Needed.

Actions #31

Updated by François ARMAND almost 8 years ago

To better explain why the -XX:OnOutOfMemoryError is not a very good solution:

On a VM with 2,5G of RAM, default Centos6, no space, where Oom is a java program that is just allocating big arrays:

At 1G heap space, everything works as expected:

root@test:~/testOOM# java6 -Xms1000m -Xmx1000m -XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" Oom
start
assigning array num. 0
assigning array num. 1000
assigning array num. 2000
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" 
#   Executing /bin/sh -c "kill -9 3613 && echo 'such memory, much killness, wow'"...
such memory, much killness, wow
Killed

At 1,3Go heap space, the forked children get killed before having a chance to kick off, NOT doing it's job:

root@test:~/testOOM# java6 -Xms1300m -Xmx1300m -XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" Oom
start
assigning array num. 0
assigning array num. 1000
assigning array num. 2000
assigning array num. 3000
assigning array num. 4000
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" 
#   Executing /bin/sh -c "kill -9 3601 && echo 'such memory, much killness, wow'"...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at Oom.main(Oom.java:18)

(actually, it's even befoare 1,3Go, because heap space is just a little party of the java RAM process, and Linux has some reserved memory, and with Xms we ask for a preallocation of the whole memory segment. But you get the basic idea: around 50% of RAM allocated to the heap, -XX:OnOutOfMemoryError stop being usefull)

And at higher level of RAM, linux is killing directly the process, not letting OOM appear (this case is out of scope, it's for information):

root@test:~/testOOM# java6 -Xms2300m -Xmx2300m -XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" Oom
start
assigning array num. 0
assigning array num. 1000
assigning array num. 2000
assigning array num. 3000
assigning array num. 4000
Killed
Actions #32

Updated by François ARMAND almost 8 years ago

  • Assignee changed from François ARMAND to Benoît PECCATTE

Benoit, could you please find what it would take to have a cron script (or promise) looking in the log and checking for OOM and restarting rudder if so ?

The main difficulities is to be able to be sure that we don't kill Rudder in loop. That should be ok, because Jetty is creating a new log file at each start of Rudder, but perhaps there is other things to take care of?

Actions #33

Updated by Florian Heigl almost 8 years ago

One thing to take care of is the presence of the various heap and crash dumps the JVM will litter around, one set per OOM.
Those can be pretty troublesome as they fill up the filesystem.
In our case we got really lucky in that we had already separated the postgresql and ldap directories.
Others may not and might face OOM-related corruption on the other components.

Actions #34

Updated by François ARMAND almost 8 years ago

Note: in JVM 8u92, two jvm options were introduced: http://www.oracle.com/technetwork/java/javase/8u92-relnotes-2949471.html

- ExitOnOutOfMemoryError
- CrashOnOutOfMemoryError

It seems to be the behavior that we want.

We should enable one or the other when we detect that the JVM version is sufficiently hight to support them.

Actions #35

Updated by Vincent MEMBRÉ almost 8 years ago

  • Target version changed from 2.11.21 to 2.11.22
Actions #36

Updated by Vincent MEMBRÉ almost 8 years ago

  • Target version changed from 2.11.22 to 2.11.23
Actions #37

Updated by Vincent MEMBRÉ over 7 years ago

  • Target version changed from 2.11.23 to 2.11.24
Actions #38

Updated by Vincent MEMBRÉ over 7 years ago

  • Target version changed from 2.11.24 to 308
Actions #39

Updated by François ARMAND over 7 years ago

Actions #40

Updated by François ARMAND over 7 years ago

Some more precision on JVM supporting these flags:

- hotspot and openjdk JDK 8u92 (1.8.0_92-b14) and up
- openjdk
- jrockit, any java 7/8
- not on IBM JVM

We have to option, I'm not sure which one we prefer. I think crash, so that we have a chance to know what part of the code OOMed.

Crash example:

% scala -J-XX:+CrashOnOutOfMemoryError

scala> val a = new Array[Array[Int]](10000)
a: Array[Array[Int]] = Array(null, null, null, null, null, ...

scala> for( i <- 0 until 10000) { a(i) = new Array(100000) }
Aborting due to java.lang.OutOfMemoryError: Java heap space
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (debug.cpp:308), pid=2125, tid=0x00007f81e7ffa700
#  fatal error: OutOfMemory encountered: Java heap space
#
# JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 compressed oops)
# Core dump written. Default location: /home/fanf/core or core.2125
#
# An error report file with more information is saved as:
# /home/fanf/hs_err_pid2125.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
/home/fanf/java/scala/bin/scala : ligne 21 :  2125 Abandon                 (core dumped)"$@" 

Or simply exit:

== [fanf@luhman16] ~ (master●) ==
% scala -J-XX:+ExitOnOutOfMemoryError
Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101).
Type in expressions for evaluation. Or try :help.

scala> val a = new Array[Array[Int]](10000)
a: Array[Array[Int]] = Array(null, null, null, null, null, ...

scala> for( i <- 0 until 10000) { a(i) = new Array(100000) }
Terminating due to java.lang.OutOfMemoryError: Java heap space
Actions #41

Updated by Vincent MEMBRÉ over 7 years ago

  • Target version changed from 308 to 3.1.14
Actions #42

Updated by Vincent MEMBRÉ over 7 years ago

  • Target version changed from 3.1.14 to 3.1.15
Actions #43

Updated by Vincent MEMBRÉ over 7 years ago

  • Target version changed from 3.1.15 to 3.1.16
Actions #44

Updated by Vincent MEMBRÉ over 7 years ago

  • Target version changed from 3.1.16 to 3.1.17
Actions #45

Updated by Vincent MEMBRÉ over 7 years ago

  • Target version changed from 3.1.17 to 3.1.18
Actions #46

Updated by Vincent MEMBRÉ about 7 years ago

  • Target version changed from 3.1.18 to 3.1.19
Actions #47

Updated by Benoît PECCATTE about 7 years ago

  • Tracker changed from Bug to User story
Actions #48

Updated by François ARMAND about 7 years ago

  • Related to Bug #8165: rudder-init fails to report memory errors from jetty start added
Actions #49

Updated by Vincent MEMBRÉ almost 7 years ago

  • Target version changed from 3.1.19 to 3.1.20
Actions #50

Updated by Vincent MEMBRÉ almost 7 years ago

  • Target version changed from 3.1.20 to 3.1.21
Actions #51

Updated by Vincent MEMBRÉ almost 7 years ago

  • Target version changed from 3.1.21 to 3.1.22
Actions #52

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 3.1.22 to 3.1.23
Actions #53

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 3.1.23 to 3.1.24
Actions #54

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 3.1.24 to 3.1.25
Actions #55

Updated by Benoît PECCATTE over 6 years ago

  • Target version changed from 3.1.25 to 4.1.9
Actions #56

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.1.9 to 4.1.10
Actions #57

Updated by Vincent MEMBRÉ about 6 years ago

  • Target version changed from 4.1.10 to 4.1.11
Actions #58

Updated by Vincent MEMBRÉ almost 6 years ago

  • Target version changed from 4.1.11 to 4.1.12
Actions #59

Updated by Vincent MEMBRÉ almost 6 years ago

  • Target version changed from 4.1.12 to 4.1.13
Actions #60

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.1.13 to 4.1.14
Actions #61

Updated by Benoît PECCATTE over 5 years ago

  • Target version changed from 4.1.14 to 4.1.15
Actions #62

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.1.15 to 4.1.16
Actions #63

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.1.16 to 4.1.17
Actions #64

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.1.17 to 4.1.18
Actions #65

Updated by Vincent MEMBRÉ about 5 years ago

  • Target version changed from 4.1.18 to 4.1.19
Actions #66

Updated by Alexis Mousset about 5 years ago

  • Target version changed from 4.1.19 to 4.1.20
Actions #67

Updated by Alexis Mousset over 4 years ago

  • Target version changed from 4.1.20 to 588
Actions #68

Updated by Alexis Mousset over 4 years ago

  • Target version changed from 588 to 5.0.13
Actions #69

Updated by Vincent MEMBRÉ over 4 years ago

  • Target version changed from 5.0.13 to 5.0.14
Actions #70

Updated by Vincent MEMBRÉ over 4 years ago

  • Target version changed from 5.0.14 to 5.0.15
Actions #71

Updated by Vincent MEMBRÉ over 4 years ago

  • Target version changed from 5.0.15 to 5.0.16
Actions #72

Updated by Alexis Mousset about 4 years ago

  • Target version changed from 5.0.16 to 5.0.17
Actions #73

Updated by Vincent MEMBRÉ almost 4 years ago

  • Target version changed from 5.0.17 to 5.0.18
Actions #74

Updated by Benoît PECCATTE almost 4 years ago

  • Target version changed from 5.0.18 to 6.2.0~beta1
Actions #75

Updated by François ARMAND almost 4 years ago

  • Status changed from Discussion to Resolved
  • Assignee deleted (Benoît PECCATTE)
  • Target version changed from 6.2.0~beta1 to 5.0.18

We had several iteration on that one since rudder 4.1, and as of rudder 5.0, the solution is pretty robust. It works with OOM and all other "non expected", fatal exceptions.

Actions

Also available in: Atom PDF