Project

General

Profile

Bug #10456

Several core features don't work anymore with more than 1000 nodes

Added by Janos Mattyasovszky 6 months ago. Updated 6 months ago.

Status:
Released
Priority:
N/A
Category:
Web - Compliance & node report
Target version:
Target version (plugin):
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
User visibility:
Operational - other Techniques | Technique editor | Rudder settings
Effort required:
Priority:
60

Description

I somehow managed to produce this by adding 2287 nodes to a 4.1-rc1, and this happened after the jetty-restart:

[2017-03-17 15:09:26] INFO  migration - Migration of inline variables in Directives to 'rudder' namespace already done, skipping
[2017-03-17 15:09:26] INFO  bootchecks - Check for force reload of Techniques library
[2017-03-17 15:09:26] INFO  bootchecks - Flag file '/opt/rudder/etc/force_technique_reload' does not exist, do not Technique library will not be reloaded
[2017-03-17 15:09:26] INFO  bootchecks - Check that system groups 'hasPolicyServer-*' are dynamics and correct them
[2017-03-17 15:09:26] INFO  bootchecks - Resume policy update if it was running before shutdown
[2017-03-17 15:09:26] INFO  bootchecks - Flag file '/opt/rudder/etc/policy-update-running' found, Start a new policy update now
[2017-03-17 15:09:26] INFO  bootstrap.liftweb.Boot - classpath
[2017-03-17 15:09:26] INFO  com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Start policy generation, checking updated rules
[2017-03-17 15:09:26] INFO  application - Application Rudder started
[2017-03-17 15:09:26] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Pre-policy-generation scripts hooks ran in 20 ms
[2017-03-17 15:09:26] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Pre-policy-generation modules hooks in 2 ms, start getting all generation related data.
2017-03-17 15:09:26.455:INFO:oejs.AbstractConnector:Started SelectChannelConnector@127.0.0.1:8080
[2017-03-17 15:09:26] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - All relevant information fetched in 313 ms, start names historization.
[2017-03-17 15:09:30] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Node contexts built in 4187 ms, start to build new node configurations.
[2017-03-17 15:09:31] ERROR historization - Error with update nodes historization process <- Could not update the nodes historization information in base. <- null
[2017-03-17 15:09:31] ERROR historization - Root cause was: null
[2017-03-17 15:09:31] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Policy generation completed in 4808 ms
[2017-03-17 15:09:31] ERROR com.normation.rudder.batch.AsyncDeploymentAgent$DeployerAgent - Error when updating policy, reason Could not update the nodes historization information in base. <- null
[2017-03-17 15:09:31] ERROR com.normation.rudder.batch.AsyncDeploymentAgent - Policy update error for process '10' at 2017-03-17 15:09:31: Could not update the nodes historization information in base.
Exception in thread "pool-6-thread-15" java.lang.StackOverflowError
        at doobie.util.atom$Atom$$anon$2.$anonfun$set$2(atom.scala:47)
        at doobie.util.atom$Atom$$anon$2.$anonfun$set$2$adapted(atom.scala:47)
        at doobie.util.composite$LowerPriorityComposite$$anon$6.$anonfun$set$4(composite.scala:109)
        at doobie.util.composite$LowerPriorityComposite$$anon$6.$anonfun$set$4$adapted(composite.scala:109)
        at doobie.util.composite$Composite$$anon$2.$anonfun$set$2(composite.scala:56)
        at doobie.util.composite$Composite$$anon$2.$anonfun$set$2$adapted(composite.scala:56)
        at doobie.util.composite$Composite$$anon$2.$anonfun$set$2(composite.scala:56)
        at doobie.util.composite$Composite$$anon$2.$anonfun$set$2$adapted(composite.scala:56)
        at doobie.util.composite$Composite$$anon$2.$anonfun$set$2(composite.scala:56)
        at doobie.util.composite$Composite$$anon$2.$anonfun$set$2$adapted(composite.scala:56)
        at doobie.util.composite$Composite$$anon$2.$anonfun$set$2(composite.scala:56)
        at doobie.util.composite$Composite$$anon$2.$anonfun$set$2$adapted(composite.scala:56)
        at doobie.util.composite$Composite$$anon$2.$anonfun$set$2(composite.scala:56)

  [[[cut for sanity, but this continues down quite a lot]]]


Related issues

Related to Rudder - Bug #11037: Missing agent reports after Rudder server restart Released
Duplicated by Rudder - Bug #10453: Error during policy generation Rejected
Duplicated by Rudder - Bug #10380: Dashboard do not load due to stackoverflow error in 4.1 Rejected

Associated revisions

Revision 916000fe
Added by François ARMAND 6 months ago

Fixes #10456: Several core features don't work anymore with more than 1000 nodes

History

#1 Updated by Alexis MOUSSET 6 months ago

  • Category set to Web - Compliance & node report
  • Target version set to 4.1.0
  • Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security

#2 Updated by Janos Mattyasovszky 6 months ago

This happend with currently 2287 in the system, I will try to go back to ~1000 nodes and test again.

#3 Updated by François ARMAND 6 months ago

This is most likelly the same bug as #10380, but it is more impacting.

A first, simple workarounnd may be to just increase the available stack space. @Janos, would you mind trying to add in `/etc/default/rudder-jetty` the following arg:

JAVA_OPTIONS="-Xss4M" 

But the correct solution is to check the number of element in the "in" clause, and if more than, say 200, create a temporary table + join.

#4 Updated by François ARMAND 6 months ago

  • Related to Bug #10380: Dashboard do not load due to stackoverflow error in 4.1 added

#5 Updated by Nicolas CHARLES 6 months ago

  • Related to Bug #10453: Error during policy generation added

#6 Updated by François ARMAND 6 months ago

  • Subject changed from Cannot historize names to Stackoverflow during generation for 2200 nodes

And as Nicolas reported, the "historization: null" is most likelly something that is not correctly initialized, and we will be correcting it in #10453

#7 Updated by Janos Mattyasovszky 6 months ago

that did work.

[2017-03-20 13:06:43] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Node configuration written on filesystem in 439848 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Node configuration cached in LDAP in 296 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Node configuration written in 440261 ms, start to update expected reports.
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Reports updated in 866 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Post-policy-generation hooks ran in 74 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Timing summary:
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Run pre-gen scripts hooks :         29 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Run pre-gen modules hooks :          2 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Fetch all information     :        358 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Historize names           :        901 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Build current rule values :         15 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Build target configuration:        766 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Update rule vals          :         65 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Increment rule serials    :         59 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations :     440261 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Save expected reports     :        866 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Run post generation hooks :         74 ms
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Number of nodes updated   :       1087
[2017-03-20 13:06:44] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Policy generation completed in 445502 ms
[2017-03-20 13:06:44] INFO  com.normation.rudder.batch.AsyncDeploymentAgent - Successful policy update '21' [started 2017-03-20 12:59:19 - ended 2017-03-20 13:06:44]

PS: These times were with writing to a ramdisk:

# df -hTP /var/rudder/share
Filesystem     Type   Size  Used Avail Use% Mounted on
none           tmpfs  2.0G  876M  1.2G  43% /var/rudder/share

#8 Updated by François ARMAND 6 months ago

OK, nice, we have a workaround.

Numbers are interesting, to. We are spending just far too much time writting files :/

#9 Updated by François ARMAND 6 months ago

Let merge all the stackoverflow problem under that one.

To sum-up, the problem is in our use of doobie Fragments.in as reported here: https://github.com/tpolecat/doobie/issues/426

#10 Updated by François ARMAND 6 months ago

  • Related to deleted (Bug #10453: Error during policy generation)

#11 Updated by François ARMAND 6 months ago

  • Duplicated by Bug #10453: Error during policy generation added

#12 Updated by François ARMAND 6 months ago

  • Related to deleted (Bug #10380: Dashboard do not load due to stackoverflow error in 4.1)

#13 Updated by François ARMAND 6 months ago

  • Duplicated by Bug #10380: Dashboard do not load due to stackoverflow error in 4.1 added

#14 Updated by François ARMAND 6 months ago

  • Subject changed from Stackoverflow during generation for 2200 nodes to Several core features don't work anymore with more than 1000 nodes
  • User visibility set to Operational - other Techniques | Technique editor | Rudder settings

#15 Updated by François ARMAND 6 months ago

  • Status changed from New to In progress
  • Assignee set to François ARMAND

#16 Updated by François ARMAND 6 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/1605

#17 Updated by Nicolas CHARLES 6 months ago

  • Assignee changed from Nicolas CHARLES to Alexis MOUSSET
  • Pull Request changed from https://github.com/Normation/rudder/pull/1605 to https://github.com/Normation/rudder/pull/1606

#18 Updated by François ARMAND 6 months ago

  • Status changed from Pending technical review to Pending release

#19 Updated by Benoît PECCATTE 6 months ago

  • Priority set to 60

#20 Updated by Benoît PECCATTE 6 months ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 4.1.0 which was released today.

#21 Updated by Nicolas CHARLES 22 days ago

  • Related to Bug #11037: Missing agent reports after Rudder server restart added

Also available in: Atom PDF