Project

General

Profile

Actions

Bug #11594

closed

JVM crash due to thread leak in CachedFindRuleNodeStatusReports

Added by François ARMAND over 6 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
N/A
Assignee:
-
Category:
Performance and scalability
Target version:
Severity:
Major - prevents use of part of Rudder | no simple workaround
UX impact:
User visibility:
Operational - other Techniques | Rudder settings | Plugins
Effort required:
Medium
Priority:
33
Name check:
Fix check:
Regression:

Description

We have a thread leak, perhaps due to a deadlock+unbounded thread pool. Some user are reported crashes with javacore filled with thousands of following stack trace:

3XMTHREADINFO      "ForkJoinPool-1-worker-46345" J9VMThread:0x0000000035564D00, j9thread_t:0x00007F1019500010, java/lang/Thread:0x00000003EE1D81C0, state:B, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1C6C3, isDaemon:true)
3XMTHREADINFO1            (native thread ID:0x4DD4, native priority:0x5, native policy:UNKNOWN, vmstate:B, vm thread flags:0x01000201)
3XMTHREADINFO2            (native stack address range from:0x00007F1018FA0000, to:0x00007F1018FE1000, size:0x41000)
3XMCPUTIME               CPU usage total: 0.332868435 secs, current category="Application" 
3XMTHREADBLOCK     Blocked on: com/normation/rudder/services/reports/CachedReportingServiceImpl@0x00000003D9B602D8 Owned by: "ForkJoinPool-1-worker-47359" (J9VMThread:0x000000003D862600, java/lang/Thread:0x0000000707287D28)
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at com/normation/rudder/services/reports/CachedFindRuleNodeStatusReports$$anonfun$invalidate$1.apply(ReportingServiceImpl.scala:152(Compiled Code))
4XESTACKTRACE                at com/normation/rudder/services/reports/CachedFindRuleNodeStatusReports$$anonfun$invalidate$1.apply(ReportingServiceImpl.scala:152(Compiled Code))
4XESTACKTRACE                at scala/concurrent/impl/ExecutionContextImpl$DefaultThreadFactory$$anon$2$$anon$4.block(ExecutionContextImpl.scala:48(Compiled Code))
4XESTACKTRACE                at scala/concurrent/forkjoin/ForkJoinPool.managedBlock(ForkJoinPool.java:3640(Compiled Code))
4XESTACKTRACE                at scala/concurrent/impl/ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45(Compiled Code))
4XESTACKTRACE                at scala/concurrent/package$.blocking(package.scala:123(Compiled Code))
4XESTACKTRACE                at com/normation/rudder/services/reports/CachedFindRuleNodeStatusReports$class.invalidate(ReportingServiceImpl.scala:152(Compiled Code))
4XESTACKTRACE                at com/normation/rudder/services/reports/CachedReportingServiceImpl.invalidate(ReportingServiceImpl.scala:76(Compiled Code))
4XESTACKTRACE                at com/normation/rudder/reports/execution/ReportsExecutionService$$anonfun$hook$2.apply$mcV$sp(ReportsExecutionService.scala:209(Compiled Code))
4XESTACKTRACE                at com/normation/rudder/reports/execution/ReportsExecutionService$$anonfun$hook$2.apply(ReportsExecutionService.scala:209(Compiled Code))
4XESTACKTRACE                at com/normation/rudder/reports/execution/ReportsExecutionService$$anonfun$hook$2.apply(ReportsExecutionService.scala:209(Compiled Code))
4XESTACKTRACE                at scala/concurrent/impl/Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24(Compiled Code))
4XESTACKTRACE                at scala/concurrent/impl/Future$PromiseCompletingRunnable.run(Future.scala:24(Compiled Code))
4XESTACKTRACE                at scala/concurrent/impl/ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121(Compiled Code))
4XESTACKTRACE                at scala/concurrent/forkjoin/ForkJoinTask.doExec(ForkJoinTask.java:260(Compiled Code))
4XESTACKTRACE                at scala/concurrent/forkjoin/ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339(Compiled Code))
4XESTACKTRACE                at scala/concurrent/forkjoin/ForkJoinPool.runWorker(ForkJoinPool.java:1979(Compiled Code))
4XESTACKTRACE                at scala/concurrent/forkjoin/ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107(Compiled Code))
3XMTHREADINFO3           No native callstack available for this thread

Related issues 1 (0 open1 closed)

Related to Rudder - Bug #14465: Generation fails with "Task FutureTask rejected from ThreadPoolExecutor" due to timeout in JS computationReleasedFrançois ARMANDActions
Actions #1

Updated by François ARMAND over 6 years ago

The problem is there:

  /**
   * Invalidate some keys in the cache. That won't charge them again
   * immediately
   *
   * Add a "blocking" signal the Future's thread pool to give more thread to other
   * because this one is taken for a long time.
   */
  def invalidate(nodeIds: Set[NodeId]): Box[Map[NodeId, (RunAndConfigInfo, Set[RuleNodeStatusReport])]] = scala.concurrent.blocking { this.synchronized {
    logger.debug(s"Compliance cache: invalidate cache for nodes: [${nodeIds.map { _.value }.mkString(",")}]")
    cache = cache -- nodeIds
    //preload new results
    checkAndUpdateCache(nodeIds)
  } }

Most likelly, for some reason we enter in a state where there is a lot of update done, and suddenly we have thread pilling up too fast to unstack. If the modification rate is sufficient to require a full new checkAndUpdateCache, we reach thread apocalypse at some point.

Actions #2

Updated by François ARMAND over 6 years ago

As it was reported during night (only ?), we can thing it happens when a loads of servers are sending there inventory. And for some reason, updating the cache take a while, perhaps because some blocking semaphor below, and we never get out of the maze.

Most likelly, we need to rewrite a the node repository to better manage LDAP thread r/w connection and remove blocking logic.

Actions #4

Updated by François ARMAND over 6 years ago

@nicolas: I don't think so, the blocking seems to be above. But well, who knows exactly ;)

Actions #5

Updated by Janos Matya over 6 years ago

For long term it might be a good idea to outsource building the policy, either to the relays or dedicated host, like a distcc setup (would most probably require a major rewrite, but could solve quite a some scalability issues you might also face in the future)

Actions #6

Updated by Benoît PECCATTE over 6 years ago

  • Effort required set to Medium
  • Priority changed from 52 to 35
Actions #7

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 3.1.24 to 3.1.25
Actions #8

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 3.1.25 to 387
  • Priority changed from 35 to 34
Actions #9

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 387 to 4.1.10
Actions #10

Updated by Vincent MEMBRÉ about 6 years ago

  • Target version changed from 4.1.10 to 4.1.11
  • Priority changed from 34 to 33
Actions #11

Updated by Vincent MEMBRÉ about 6 years ago

  • Target version changed from 4.1.11 to 4.1.12
  • Priority changed from 33 to 32
Actions #12

Updated by Vincent MEMBRÉ almost 6 years ago

  • Target version changed from 4.1.12 to 4.1.13
  • Priority changed from 32 to 31
Actions #13

Updated by Benoît PECCATTE almost 6 years ago

  • Target version changed from 4.1.13 to 411
Actions #14

Updated by Benoît PECCATTE almost 6 years ago

  • Target version changed from 411 to 4.1.13
Actions #15

Updated by Vincent MEMBRÉ almost 6 years ago

  • Target version changed from 4.1.13 to 4.1.14
Actions #16

Updated by Benoît PECCATTE over 5 years ago

  • Target version changed from 4.1.14 to 4.1.15
  • Priority changed from 31 to 30
Actions #17

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.1.15 to 4.1.16
  • Priority changed from 30 to 29
Actions #18

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.1.16 to 4.1.17
Actions #19

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.1.17 to 4.1.18
  • Priority changed from 29 to 0
Actions #20

Updated by Vincent MEMBRÉ over 5 years ago

  • Target version changed from 4.1.18 to 4.1.19
Actions #21

Updated by Alexis Mousset about 5 years ago

  • Target version changed from 4.1.19 to 4.1.20
Actions #22

Updated by François ARMAND about 5 years ago

  • Target version changed from 4.1.20 to 4.1.21
Actions #23

Updated by Vincent MEMBRÉ about 5 years ago

  • Target version changed from 4.1.21 to 4.1.22
Actions #24

Updated by Nicolas CHARLES about 5 years ago

  • Related to Bug #14465: Generation fails with "Task FutureTask rejected from ThreadPoolExecutor" due to timeout in JS computation added
Actions #25

Updated by François ARMAND about 5 years ago

  • User visibility changed from Operational - other Techniques | Technique editor | Rudder settings to Operational - other Techniques | Rudder settings | Plugins
  • Priority changed from 0 to 29
Actions #26

Updated by Vincent MEMBRÉ almost 5 years ago

  • Target version changed from 4.1.22 to 4.1.23
Actions #27

Updated by Vincent MEMBRÉ almost 5 years ago

  • Target version changed from 4.1.23 to 4.1.24
Actions #28

Updated by Vincent MEMBRÉ almost 5 years ago

  • Target version changed from 4.1.24 to 588
Actions #29

Updated by Benoît PECCATTE almost 5 years ago

  • Target version changed from 588 to 5.0.13

Updated to 5.0 but I need a second thought, @François ?

Actions #30

Updated by Vincent MEMBRÉ over 4 years ago

  • Target version changed from 5.0.13 to 5.0.14
Actions #31

Updated by Vincent MEMBRÉ over 4 years ago

  • Target version changed from 5.0.14 to 5.0.15
Actions #32

Updated by Vincent MEMBRÉ over 4 years ago

  • Target version changed from 5.0.15 to 5.0.16
Actions #33

Updated by Alexis Mousset about 4 years ago

  • Target version changed from 5.0.16 to 5.0.17
Actions #34

Updated by Vincent MEMBRÉ about 4 years ago

  • Target version changed from 5.0.17 to 5.0.18
Actions #35

Updated by Vincent MEMBRÉ over 3 years ago

  • Target version changed from 5.0.18 to 5.0.19
Actions #36

Updated by Vincent MEMBRÉ over 3 years ago

  • Target version changed from 5.0.19 to 5.0.20
Actions #37

Updated by Vincent MEMBRÉ over 3 years ago

  • Target version changed from 5.0.20 to 797
Actions #38

Updated by Benoît PECCATTE almost 3 years ago

  • Target version changed from 797 to 6.1.14
  • Priority changed from 29 to 31
Actions #39

Updated by Vincent MEMBRÉ almost 3 years ago

  • Target version changed from 6.1.14 to 6.1.15
Actions #40

Updated by Vincent MEMBRÉ almost 3 years ago

  • Target version changed from 6.1.15 to 6.1.16
Actions #41

Updated by Vincent MEMBRÉ over 2 years ago

  • Target version changed from 6.1.16 to 6.1.17
  • Priority changed from 31 to 32
Actions #42

Updated by Vincent MEMBRÉ over 2 years ago

  • Target version changed from 6.1.17 to 6.1.18
Actions #43

Updated by Vincent MEMBRÉ over 2 years ago

  • Target version changed from 6.1.18 to 6.1.19
  • Priority changed from 32 to 33
Actions #44

Updated by François ARMAND about 2 years ago

  • Status changed from New to Resolved

This is corrected with the switch to ZIO in rudder 6.1.

Actions

Also available in: Atom PDF