Project

General

Profile

Bug #10457

Hook failed with fork: retry: No child processes

Added by Janos Mattyasovszky 8 months ago. Updated 8 months ago.

Status:
Released
Priority:
N/A
Category:
System integration
Target version:
Target version (plugin):
Severity:
Major - prevents use of part of Rudder | no simple workaround
User visibility:
Infrequent - complex configurations | third party integrations
Effort required:
Priority:
25

Description

I got an error after I found #10456:

[2017-03-17 15:31:18] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Policy generation completed in 93326 ms
[2017-03-17 15:31:18] ERROR com.normation.rudder.batch.AsyncDeploymentAgent$DeployerAgent - Error when updating policy, reason Cannot write configuration node <- Exit code=1 for hook: '/opt/rudder/etc/hooks.d/policy-generation-node-ready/10-cf-promise-check' with environment variables: [PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin] [NLSPATH:/usr/dt/lib/nls/msg/%L/%N.cat] [OLDPWD:/] [XFILESEARCHPATH:/usr/dt/app-defaults/%L/Dt] [PWD:/opt/rudder/jetty7] [SHLVL:1] [_:/usr/bin/java] [RUDDER_GENERATION_DATETIME:2017-03-17T15:29:45.296+01:00] [RUDDER_NODEID:61053f9f-b3de-4290-9eda-bc4fe1567233] [RUDDER_NEXT_POLICIES_DIRECTORY:/var/rudder/share/61053f9f-b3de-4290-9eda-bc4fe1567233/rules.new/cfengine-community] [RUDDER_AGENT_TYPE:cfengine-community].
  Stdout: '   error: Can't stat file '/var/rudder/ncf//var/rudder/ncf/common/10_ncf_internals/list-compatible-inputs: fork: retry: No child processes' for parsing. (stat: No such file or directory)
'
  Stderr: ''
[2017-03-17 15:31:18] ERROR com.normation.rudder.batch.AsyncDeploymentAgent - Policy update error for process '12' at 2017-03-17 15:31:18: Cannot write configuration node

Not sure if this isn't a limitation of nofiles, so it cannot fork?

Associated revisions

Revision bef6684f
Added by François ARMAND 8 months ago

Fixes #10457: Hook failed with fork: retry: No child processes

History

#1 Updated by Janos Mattyasovszky 8 months ago

hah, found it:

[ 7219.731466] cgroup: fork rejected by pids controller in /system.slice/rudder-jetty.service
[12893.159767] cgroup: fork rejected by pids controller in /system.slice/rudder-jetty.service

#2 Updated by Janos Mattyasovszky 8 months ago

According to https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#TasksMax=N, the fix would be to include this line in the unit file (which currently is auto-generated):

TasksMax=infinity

will test this.

#3 Updated by Janos Mattyasovszky 8 months ago

copied the auto-generated unit file to /etc, and added the missing line:

sles12# systemctl cat rudder-jetty
# /etc/systemd/system/rudder-jetty.service
[Unit]
SourcePath=/etc/init.d/rudder-jetty
After=remote-fs.target network-online.target
Wants=remote-fs.target network-online.target

[Service]
Type=forking
TasksMax=infinity  <== Added this
Restart=no
TimeoutSec=5min
IgnoreSIGPIPE=no
KillMode=process
GuessMainPID=no
RemainAfterExit=yes
ExecStart=/etc/init.d/rudder-jetty start
ExecStop=/etc/init.d/rudder-jetty stop

#4 Updated by Alexis MOUSSET 8 months ago

  • Category set to System integration
  • Target version set to 4.1.0
  • Severity set to Major - prevents use of part of Rudder | no simple workaround

#5 Updated by François ARMAND 8 months ago

Well, perhaps it's better if we cap the number of parallel hook to say, 50? (or "number cpu + 1" or a configurable parameter). That won't change the throughout but certainly stress less the system and avoid these limit.

#6 Updated by Janos Mattyasovszky 8 months ago

I'd be happy with nproc --all, the only problem is, what if I scale my VM during operations up, and give it more cores? Would I have to restart jetty then? Could this maybe be checked at each time a policy generation is started?

#7 Updated by François ARMAND 8 months ago

Oh yes, the thread pool and manager logic is created each time. But I will make sure of that, thanks for pointing that use case.

#8 Updated by François ARMAND 8 months ago

  • User visibility set to Infrequent - complex configurations | third party integrations

#9 Updated by François ARMAND 8 months ago

  • Status changed from New to In progress
  • Assignee set to François ARMAND

#10 Updated by François ARMAND 8 months ago

OK, so when using a real task manager, I get more consistant results, around 10% better. But performance are hard etc.

Before:

Write node configurations :      91750 ms
...
Write node configurations :      85166 ms
...
Write node configurations :      95879 ms

After:

Write node configurations :      79947 ms
...
Write node configurations :      79191 ms
...
Write node configurations :      75608 ms

See pull requests for details.

#11 Updated by François ARMAND 8 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/1608

#12 Updated by Nicolas CHARLES 8 months ago

without the PR , for 1602 nodes
[2017-03-23 14:37:58] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 55878 ms
[2017-03-23 14:46:10] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 57303 ms
[2017-03-23 14:47:35] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 39640 ms
[2017-03-23 14:48:42] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 33611 ms
[2017-03-23 14:50:04] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 35395 ms

with this PR
[2017-03-23 15:09:16] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 34423 ms
[2017-03-23 15:10:31] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 40921 ms
[2017-03-23 15:12:14] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 48889 ms
[2017-03-23 15:14:09] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 32235 ms

Note that it is on a laptop, so not really reliable

#13 Updated by François ARMAND 8 months ago

  • Status changed from Pending technical review to Pending release

#14 Updated by Janos Mattyasovszky 8 months ago

Without this PR on 32cpus and 7000+ nodes:

Dunno, it never finished, and I stopped it after 9+ hours

With this PR (same system):

Sum ~28 minutes (just base policy, no rules/directives).

#15 Updated by Benoît PECCATTE 8 months ago

  • Priority set to 25

#16 Updated by Benoît PECCATTE 8 months ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 4.1.0 which was released today.

Also available in: Atom PDF