Project

General

Profile

Bug #10532

On centos7, generation fails because of hook policy-generation-node-ready/10-cf-promise-check

Added by François ARMAND 8 months ago. Updated 7 months ago.

Status:
Released
Priority:
N/A
Category:
Web - Config management
Target version:
Target version (plugin):
Severity:
User visibility:
Effort required:
Priority:
0

Description

The generation error is:

⇨ Policy update error for process '4' at 2017-03-29 13:25:32 
⇨ Cannot write configuration node 
⇨ Exit code=1 for hook: '/opt/rudder/etc/hooks.d/policy-generation-node-ready/10-cf-promise-check' with environment variables: [PATH:/sbin:/usr/sbin:/bin:/usr/bin] [SYSTEMCTL_IGNORE_DEPENDENCIES:] [NLSPATH:/usr/dt/lib/nls/msg/%L/%N.cat] [SYSTEMCTL_SKIP_REDIRECT:] [OLDPWD:/opt/rudder/jetty7] [TERM:vt100] [XFILESEARCHPATH:/usr/dt/app-defaults/%L/Dt] [PWD:/opt/rudder/jetty7] [SHLVL:2] [_:/bin/java] [RUDDER_GENERATION_DATETIME:2017-03-29T13:25:29.110Z] [RUDDER_NODEID:d456d9d9-9d1a-4116-a495-340ed55f6c32] [RUDDER_NEXT_POLICIES_DIRECTORY:/var/rudder/share/d456d9d9-9d1a-4116-a495-340ed55f6c32/rules.new/cfengine-community] [RUDDER_AGENT_TYPE:cfengine-community]. 
Stdout: ' error: Can't stat file '/var/rudder/ncf/find: '/var/rudder/cfengine-community/state/ncf-exclude-cache-3.10.0/_var_rudder_ncf_common_10_ncf_internals': No such file or directory' for parsing. (stat: No such file or directory)
' 
Stderr: ''

Associated revisions

Revision fea371e8
Added by Alexis MOUSSET 8 months ago

Fixes #10532: On centos7, generation fails because of hook policy-generation-node-ready/10-cf-promise-check

History

#1 Updated by François ARMAND 8 months ago

This is transcient and starting an other full regeneration manually corrected the problem.

#2 Updated by François ARMAND 8 months ago

We have the problem on 2 different centos 7. The failing part is not always the same. Asking for a new regeneration (or a full new generation) clean the problem (most of the time).
Asking for a full regeneration have a hight probability (35% ?) of chance to lead to a variant of the error.
It is not always the same node, nor the same file which have the error. Below, some example:


#### two nodes failing, root and one behing a relay: 

 Exit code=1 for hook: '/opt/rudder/etc/hooks.d/policy-generation-node-ready/10-cf-promise-check' with environment variables: [PATH:/sbin:/usr/sbin:/bin:/usr/bin] [SYSTEMCTL_IGNORE_DEPENDENCIES:] [NLSPATH:/usr/dt/lib/nls/msg/%L/%N.cat] [SYSTEMCTL_SKIP_REDIRECT:] [OLDPWD:/opt/rudder/jetty7] [TERM:vt100] [XFILESEARCHPATH:/usr/dt/app-defaults/%L/Dt] [PWD:/opt/rudder/jetty7] [SHLVL:2] [_:/bin/java] [RUDDER_GENERATION_DATETIME:2017-03-29T14:16:00.537Z] [RUDDER_NODEID:root] [RUDDER_NEXT_POLICIES_DIRECTORY:/var/rudder/cfengine-community/inputs.new] [RUDDER_AGENT_TYPE:cfengine-community]. 
Stdout: ' error: Can't stat file '/var/rudder/ncf/find: '/var/rudder/cfengine-community/state/ncf-exclude-cache-3.10.0/_var_rudder_ncf_common_30_generic_methods': No such file or directory' for parsing. (stat: No such file or directory)
' 
Stderr: '' ; Exit code=1 for hook: '/opt/rudder/etc/hooks.d/policy-generation-node-ready/10-cf-promise-check' with environment variables: [PATH:/sbin:/usr/sbin:/bin:/usr/bin] [SYSTEMCTL_IGNORE_DEPENDENCIES:] [NLSPATH:/usr/dt/lib/nls/msg/%L/%N.cat] [SYSTEMCTL_SKIP_REDIRECT:] [OLDPWD:/opt/rudder/jetty7] [TERM:vt100] [XFILESEARCHPATH:/usr/dt/app-defaults/%L/Dt] [PWD:/opt/rudder/jetty7] [SHLVL:2] [_:/bin/java] [RUDDER_GENERATION_DATETIME:2017-03-29T14:16:00.537Z] [RUDDER_NODEID:26acd240-5347-4e0c-91e9-d6281341cf2b] [RUDDER_NEXT_POLICIES_DIRECTORY:/var/rudder/share/19a0eaaf-7d56-4251-a697-83942b18df20/share/26acd240-5347-4e0c-91e9-d6281341cf2b/rules.new/cfengine-community] [RUDDER_AGENT_TYPE:cfengine-community]. 
Stdout: ' error: Can't stat file '/var/rudder/ncf/find: '/var/rudder/cfengine-community/state/ncf-exclude-cache-3.10.0/_var_rudder_ncf_common_10_ncf_internals': No such file or directory' for parsing. (stat: No such file or directory)
' 
Stderr: ''

#### one other example:

⇨ Policy update error for process '17' at 2017-03-29 14:22:29 
⇨ Cannot write configuration node 
⇨ Exit code=1 for hook: '/opt/rudder/etc/hooks.d/policy-generation-node-ready/10-cf-promise-check' with environment variables: [PATH:/sbin:/usr/sbin:/bin:/usr/bin] [SYSTEMCTL_IGNORE_DEPENDENCIES:] [NLSPATH:/usr/dt/lib/nls/msg/%L/%N.cat] [SYSTEMCTL_SKIP_REDIRECT:] [OLDPWD:/opt/rudder/jetty7] [TERM:vt100] [XFILESEARCHPATH:/usr/dt/app-defaults/%L/Dt] [PWD:/opt/rudder/jetty7] [SHLVL:2] [_:/bin/java] [RUDDER_GENERATION_DATETIME:2017-03-29T14:22:23.982Z] [RUDDER_NODEID:19a0eaaf-7d56-4251-a697-83942b18df20] [RUDDER_NEXT_POLICIES_DIRECTORY:/var/rudder/share/19a0eaaf-7d56-4251-a697-83942b18df20/rules.new/cfengine-community] [RUDDER_AGENT_TYPE:cfengine-community]. 
Stdout: ' error: Can't stat file '/var/rudder/ncf/find: '/var/rudder/cfengine-community/state/ncf-exclude-cache-3.10.0/_var_rudder_ncf_common_60_services': No such file or directory' for parsing. (stat: No such file or directory)
' 
Stderr: ''

#### an other example:

⇨ Policy update error for process '26' at 2017-03-29 14:27:30 
⇨ Cannot write configuration node 
⇨ Exit code=1 for hook: '/opt/rudder/etc/hooks.d/policy-generation-node-ready/10-cf-promise-check' with environment variables: [PATH:/sbin:/usr/sbin:/bin:/usr/bin] [SYSTEMCTL_IGNORE_DEPENDENCIES:] [NLSPATH:/usr/dt/lib/nls/msg/%L/%N.cat] [SYSTEMCTL_SKIP_REDIRECT:] [OLDPWD:/opt/rudder/jetty7] [TERM:vt100] [XFILESEARCHPATH:/usr/dt/app-defaults/%L/Dt] [PWD:/opt/rudder/jetty7] [SHLVL:2] [_:/bin/java] [RUDDER_GENERATION_DATETIME:2017-03-29T14:27:23.982Z] [RUDDER_NODEID:26acd240-5347-4e0c-91e9-d6281341cf2b] [RUDDER_NEXT_POLICIES_DIRECTORY:/var/rudder/share/19a0eaaf-7d56-4251-a697-83942b18df20/share/26acd240-5347-4e0c-91e9-d6281341cf2b/rules.new/cfengine-community] [RUDDER_AGENT_TYPE:cfengine-community]. 
Stdout: ' error: Can't stat file '/var/rudder/ncf/find: '/var/rudder/cfengine-community/state/ncf-exclude-cache-3.10.0/_var_rudder_ncf_common_20_cfe_basics': No such file or directory' for parsing. (stat: No such file or directory)
' 
Stderr: ''

And so on.

#3 Updated by Alexis MOUSSET 8 months ago

This is a race condition in list-compatible-inputs that seems to only happen on CentOS7:

root@server:/opt/rudder/jetty7# /var/rudder/cfengine-community/bin/cf-promises -f /var/rudder/cfengine-community/inputs.new/promises.cf &
[6] 22088
root@server:/opt/rudder/jetty7# /var/rudder/cfengine-community/bin/cf-promises -f /var/rudder/cfengine-community/inputs.new/promises.cf &
[7] 22645
root@server:/opt/rudder/jetty7#    error: Can't stat file '/var/rudder/ncf/find: '/var/rudder/cfengine-community/state/ncf-exclude-cache-3.10.0/_var_rudder_ncf_common_30_generic_methods': No such file or directory' for parsing. (stat: No such file or directory)

#4 Updated by François ARMAND 8 months ago

This only happen when SELinux is enabled. If we "setenforce 0", the problem completelly disapears.

#5 Updated by François ARMAND 8 months ago

So, it may not be linked to selinux after all - it was just less frequent when testing that track.

The problem may be that the file "/opt/rudder/etc/agent-capabilities" was not viewed as newer than the cache (because for some reason, the cache was modified in the future), and so one of the parallel process delete the cache, and the next one try to do comparison on deleted files.

#6 Updated by Alexis MOUSSET 8 months ago

  • Assignee set to Alexis MOUSSET
  • Target version changed from 4.1.1 to 4.0.4

What happens here:

  • Our CentOS7 package builder had an EDT (-04) timezone for some reason
  • When installing recently build packages on a machine ahead of EDT, /opt/rudder/etc/agent-capabilities has a modification date in the future.
  • The comparison for compatible files cache invalidation in ncf's list-compatible-inputs is based on the modification date of /opt/rudder/etc/agent-capabilities compared to the cache itself
  • list-compatible-inputs always invalidates the cache and rebuilds a new one
  • When starting several list-compatible-inputs calls at the same time (what is done by the generation process when calling cf-promises to check the files), all of them will invalidate the cache every time. This leads to race conditions where one of the instances tries to reach a file that has just been removed by another one. Normally this should be very uncommon, and can only happen just after updating the agent or ncf.
  • => generation is randomly broken

We will touch the agent-capability file in postinstall to avoid this situation in the future. The race condition is not really fixable right now, but policy validation on the nodes instead of the server will fix this issue.

Note: The packages built far enough in the past have now become good (like wine or cheese), so it is not a blocking issue for 4.1.0.

#7 Updated by Alexis MOUSSET 8 months ago

  • Status changed from New to In progress

#8 Updated by Alexis MOUSSET 8 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Alexis MOUSSET to Benoît PECCATTE
  • Pull Request set to https://github.com/Normation/rudder-packages/pull/1326

#9 Updated by Alexis MOUSSET 8 months ago

  • Status changed from Pending technical review to Pending release

#10 Updated by Vincent MEMBRÉ 7 months ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 4.0.4 and 4.1.1 which were released today.

Also available in: Atom PDF