Project

General

Profile

Actions

Bug #7338

closed

All reports are missing (totally orange) for a node due to multiple cf-execd processes

Added by Dennis Cabooter over 8 years ago. Updated over 8 years ago.

Status:
Released
Priority:
3
Category:
Packaging
Target version:
Severity:
UX impact:
User visibility:
Effort required:
Priority:
Name check:
Fix check:
Regression:

Description

All reports are missing (totally orange) for a node due to multiple cf-execd processes. The logs are there and visible in the web UI.

Workaround: Login on the node. stop rudder-agent. kill -9 cf-execd process which is still running. Start rudder-agent.


Related issues 1 (0 open1 closed)

Related to Rudder - Bug #7189: issues with process management on physical hosting LXC containersReleasedMatthieu CERDA2015-09-12Actions
Actions #1

Updated by Nicolas CHARLES over 8 years ago

Dennis, what happen if you run bash -x /opt/rudder/bin/check-rudder-agent ?
What is the exit code ?

Actions #2

Updated by Dennis Cabooter over 8 years ago

# ps wwwuax|grep cf-exec|grep -v grep
root      1679  0.0  0.3 107816  3984 ?        Ss   09:34   0:00 /var/rudder/cfengine-community/bin/cf-execd
root      2046  0.0  0.3 107816  3984 ?        Ss   09:34   0:00 /var/rudder/cfengine-community/bin/cf-execd
# bash -x /opt/rudder/bin/check-rudder-agent
+ . /etc/profile
++ '[' '' ']'
++ '[' -d /etc/profile.d ']'
++ for i in '/etc/profile.d/*.sh'
++ '[' -r /etc/profile.d/bash_completion.sh ']'
++ . /etc/profile.d/bash_completion.sh
+++ '[' -n '4.3.11(1)-release' -a -n '' -a -z '' ']'
++ for i in '/etc/profile.d/*.sh'
++ '[' -r /etc/profile.d/rudder-agent.sh ']'
++ . /etc/profile.d/rudder-agent.sh
+++ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/rudder/cfengine-community/bin:/var/rudder/cfengine-community/bin
+++ export PATH
+++ type manpath
++++ manpath
+++ MANPATH=/usr/local/man:/usr/local/share/man:/usr/share/man:/opt/rudder/share/man:/opt/rudder/share/man
+++ export MANPATH
++ unset i
+ set -e
+ export PATH=/opt/rudder/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/rudder/cfengine-community/bin:/var/rudder/cfengine-community/bin
+ PATH=/opt/rudder/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/rudder/cfengine-community/bin:/var/rudder/cfengine-community/bin
+ BACKUP_DIR=/var/backups/rudder/
++ uname -s
+ OS_FAMILY=Linux
+ CFENGINE_DB_EXT=lmdb
+ '[' zLinux = zAIX ']'
+ CP_A='cp -a'
+ CFE_DIR=/var/rudder/cfengine-community
+ CFE_BIN_DIR=/var/rudder/cfengine-community/bin
+ CFE_DISABLE_FILE=/opt/rudder/etc/disable-agent
+ LAST_UPDATE_FILE=/var/rudder/cfengine-community/last_successful_inputs_update
+ UUID_FILE=/opt/rudder/etc/uuid.hive
++ whoami
+ '[' '!' root = root ']'
+ check_and_fix_rudder_uuid
+ LATEST_BACKUPED_UUID=
+ '[' '!' -e /opt/rudder/etc/uuid.hive ']'
++ wc -l
++ grep -E '^[a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{12}|root'
++ cat /opt/rudder/etc/uuid.hive
+ CHECK_UUID=1
+ '[' 1 -ne 1 ']'
+ check_and_fix_cfengine_processes
++ ps -h -o utsns --pid 5742
+ ns=4026531838
+ '[' -e /proc/bc/0 ']'
+ '[' -n 4026531838 ']'
+ PS_COMMAND='eval ps --no-header -e -O utsns | grep -E '\''^[[:space:]]*[[:digit:]]*[[:space:]]+4026531838'\'''
++ cat
++ grep -E cf-execd
++ grep -v grep
++ eval ps --no-header -e -O utsns '|' grep -E ''\''^[[:space:]]*[[:digit:]]*[[:space:]]+4026531838'\'''
+++ grep -E '^[[:space:]]*[[:digit:]]*[[:space:]]+4026531838'
+++ ps --no-header -e -O utsns
+ CF_EXECD_RUNNING=' 1679 4026531838 S ?        00:00:00 /var/rudder/cfengine-community/bin/cf-execd
 2046 4026531838 S ?        00:00:00 /var/rudder/cfengine-community/bin/cf-execd'
++ wc -l
++ grep -v '^$'
++ echo ' 1679 4026531838 S ?        00:00:00 /var/rudder/cfengine-community/bin/cf-execd
 2046 4026531838 S ?        00:00:00 /var/rudder/cfengine-community/bin/cf-execd'
+ NB_CF_EXECD_RUNNING=2
+ '[' 2 -gt 1 ']'
+ echo_n 'WARNING: Too many instance of CFEngine cf-execd processes running. Killing them...'
+ '[' zLinux = zAIX ']'
+ echo -n WARNING: Too many instance of CFEngine cf-execd processes running. Killing them...
WARNING: Too many instance of CFEngine cf-execd processes running. Killing them...+ xargs kill -9
+ awk 'BEGIN { OFS=" "} {print $2 }'
+ echo ' 1679 4026531838 S ?        00:00:00 /var/rudder/cfengine-community/bin/cf-execd
 2046 4026531838 S ?        00:00:00 /var/rudder/cfengine-community/bin/cf-execd'
+ true
+ echo ' Done'
 Done
++ cat
++ grep -E '/var/rudder/cfengine-community/bin/(cf-execd|cf-agent)'
++ grep -v grep
++ eval ps --no-header -e -O utsns '|' grep -E ''\''^[[:space:]]*[[:digit:]]*[[:space:]]+4026531838'\'''
+++ grep -E '^[[:space:]]*[[:digit:]]*[[:space:]]+4026531838'
+++ ps --no-header -e -O utsns
+ CF_PROCESS_RUNNING=' 1679 4026531838 S ?        00:00:00 /var/rudder/cfengine-community/bin/cf-execd
 2046 4026531838 S ?        00:00:00 /var/rudder/cfengine-community/bin/cf-execd'
++ wc -l
++ grep -v '^$'
++ echo ' 1679 4026531838 S ?        00:00:00 /var/rudder/cfengine-community/bin/cf-execd
 2046 4026531838 S ?        00:00:00 /var/rudder/cfengine-community/bin/cf-execd'
+ NB_CF_PROCESS_RUNNING=2
+ '[' '!' -e /opt/rudder/etc/disable-agent -a 2 -eq 0 -a -f /var/rudder/cfengine-community/policy_server.dat ']'
+ '[' -f /var/rudder/cfengine-community/inputs/run_interval ']'
++ cat /var/rudder/cfengine-community/inputs/run_interval
+ RUN_INTERVAL=15
++ expr 15 '*' 2
+ CHECK_INTERVAL=30
+ '[' '!' -e /var/rudder/cfengine-community/last_successful_inputs_update -o -e /opt/rudder/etc/disable-agent ']'
++ find /var/rudder/cfengine-community/last_successful_inputs_update -mmin +30
+ test
+ '[' 2 -gt 8 ']'
+ check_and_fix_cf_lock
+ MAX_CF_LOCK_SIZE=10485760
+ '[' -e /var/rudder/cfengine-community/state/cf_lock.lmdb ']'
+ '[' zLinux = zAIX ']'
++ stat -c%s /var/rudder/cfengine-community/state/cf_lock.lmdb
+ CF_LOCK_SIZE=155648
+ '[' 155648 -ge 10485760 ']'
+ '[' zLinux '!=' zAIX ']'
+ check_and_fix_specific_rudder_agent_file /etc/init.d/rudder-agent init
+ FILE_TO_RESTORE=/etc/init.d/rudder-agent
+ FILE_TYPE=init
+ LATEST_BACKUPED_FILES=
+ '[' '!' -e /etc/init.d/rudder-agent ']'
+ check_and_fix_specific_rudder_agent_file /etc/default/rudder-agent default
+ FILE_TO_RESTORE=/etc/default/rudder-agent
+ FILE_TYPE=default
+ LATEST_BACKUPED_FILES=
+ '[' '!' -e /etc/default/rudder-agent ']'
+ check_and_fix_specific_rudder_agent_file /etc/cron.d/rudder-agent cron
+ FILE_TO_RESTORE=/etc/cron.d/rudder-agent
+ FILE_TYPE=cron
+ LATEST_BACKUPED_FILES=
+ '[' '!' -e /etc/cron.d/rudder-agent ']'
+ base=/var/rudder/cfengine-community/inputs
+ empty /var/rudder/cfengine-community/inputs/common/1.0/update.cf
+ '[' '!' -f /var/rudder/cfengine-community/inputs/common/1.0/update.cf ']'
++ awk '{print $1}'
++ du /var/rudder/cfengine-community/inputs/common/1.0/update.cf
+ '[' 20 = 0 ']'
+ empty /var/rudder/cfengine-community/inputs/failsafe.cf
+ '[' '!' -f /var/rudder/cfengine-community/inputs/failsafe.cf ']'
++ awk '{print $1}'
++ du /var/rudder/cfengine-community/inputs/failsafe.cf
+ '[' 8 = 0 ']'
+ empty /var/rudder/cfengine-community/inputs/promises.cf
+ '[' '!' -f /var/rudder/cfengine-community/inputs/promises.cf ']'
++ awk '{print $1}'
++ du /var/rudder/cfengine-community/inputs/promises.cf
+ '[' 36 = 0 ']'
# ps wwwuax|grep cf-exec|grep -v grep
root      1679  0.0  0.3 107816  3984 ?        Ss   09:34   0:00 /var/rudder/cfengine-community/bin/cf-execd
root      2046  0.0  0.3 107816  3984 ?        Ss   09:34   0:00 /var/rudder/cfengine-community/bin/cf-execd
# ps wwwuax|grep cf-exec|grep -v grep
root      1679  0.0  0.3 107816  3984 ?        Ss   09:34   0:00 /var/rudder/cfengine-community/bin/cf-execd
root      2046  0.0  0.3 107816  3984 ?        Ss   09:34   0:00 /var/rudder/cfengine-community/bin/cf-execd
# /etc/init.d/rudder-agent stop
rudder-agent[7161]: [INFO] Using /etc/default/rudder-agent for configuration
rudder-agent[7164]: [INFO] Using /var/rudder/cfengine-community for CFEngine workdir
rudder-agent[7165]: [INFO] Halting CFEngine Community cf-serverd...
rudder-agent[7376]: [OK] CFEngine Community cf-serverd stopped after 2 seconds
rudder-agent[7377]: [INFO] Halting CFEngine Community cf-execd...
rudder-agent[8140]: [OK] CFEngine Community cf-execd stopped after 6 seconds
# ps wwwuax|grep cf-exec|grep -v grep
root      1679  0.0  0.3 107816  3984 ?        Ss   09:34   0:00 /var/rudder/cfengine-community/bin/cf-execd
# kill 1679
# ps wwwuax|grep cf-exec|grep -v grep
# /etc/init.d/rudder-agent start
rudder-agent[8902]: [INFO] Using /etc/default/rudder-agent for configuration
rudder-agent[8905]: [INFO] Using /var/rudder/cfengine-community for CFEngine workdir
rudder-agent[8906]: [INFO] Launching CFEngine Community cf-serverd...
rudder-agent[9081]: [OK] CFEngine Community cf-serverd started after 1 seconds
rudder-agent[9082]: [INFO] Launching CFEngine Community cf-execd...
rudder-agent[9258]: [OK] CFEngine Community cf-execd started after 1 seconds
# ps wwwuax|grep cf-exec|grep -v grep
root      9255  0.0  0.2  40224  2860 ?        Ss   10:49   0:00 /var/rudder/cfengine-community/bin/cf-execd
Actions #3

Updated by Dennis Cabooter over 8 years ago

It seems like this is only happening on Ubuntu machines, not on CentOS/RHEL ones.

Actions #4

Updated by Nicolas CHARLES over 8 years ago

  • Category set to Packaging
  • Assignee changed from Nicolas CHARLES to Benoît PECCATTE
  • Target version set to 2.11.17

Ok, the problem is

echo -n WARNING: Too many instance of CFEngine cf-execd processes running. Killing them...
WARNING: Too many instance of CFEngine cf-execd processes running. Killing them...+ xargs kill -9
+ awk 'BEGIN { OFS=" "} {print $2 }'
+ echo ' 1679 4026531838 S ?        00:00:00 /var/rudder/cfengine-community/bin/cf-execd
 2046 4026531838 S ?        00:00:00 /var/rudder/cfengine-community/bin/cf-execd'

it does detect that there are 2 cf-execd running, but doesn't get the proper entry for pid

This is probably linked to #7189 and #7243

Could not reproduce it on Centos nor Debian 7, but on Ubuntu the value is invalid

echo ${PS_COMMAND}
eval ps --no-header -e -O utsns | grep -E '^[[:space:]]*[[:digit:]]*[[:space:]]+4026531838'

but I do not have namespace; i think we should use ps -ef

Actions #5

Updated by Nicolas CHARLES over 8 years ago

  • Related to Bug #7189: issues with process management on physical hosting LXC containers added
Actions #6

Updated by Benoît PECCATTE over 8 years ago

Ubuntu supporte namespaces and in the previous output the command
ps -h -o utsns --pid $$
gives 4026531838 (the value in your grep) which only possible if you have namespace support.

But I see a possible reason, ps -O utsns change the output field order so the kill doesn't work.

Actions #7

Updated by Benoît PECCATTE over 8 years ago

  • Status changed from New to In progress
Actions #8

Updated by Benoît PECCATTE over 8 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Benoît PECCATTE to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder-packages/pull/783
Actions #9

Updated by Benoît PECCATTE over 8 years ago

  • Status changed from Pending technical review to Pending release
  • % Done changed from 0 to 100
Actions #11

Updated by Vincent MEMBRÉ over 8 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 2.11.17, 3.0.12 and 3.1.5 which were released today.

Actions #12

Updated by Florian Heigl over 8 years ago

I found one needs to also modify

@[root@rudder 1.0]# git show
commit 811a3ca2e8f1342b58fb19151e720c1ffda68da8
Author: root user (CLI) <root@localhost>
Date: Wed Jan 13 00:34:57 2016 +0100

adjust for lxc env

diff --git a/techniques/system/common/1.0/promises.st b/techniques/system/common/1.0/promises.st
index b59974c..5b6db6e 100644
--- a/techniques/system/common/1.0/promises.st
+++ b/techniques/system/common/1.0/promises.st
@ -341,12 +341,12 @ bundle agent check_cf_processes
# process_kill is the same for SIGKILL.
!windows:: # On windows, cf-execd is a service, and there can be only one instance of it running (by design)
- "process_term[execd]" string => "2";
- "process_kill[execd]" string => "5";
+ "process_term[execd]" string => "6";
+ "process_kill[execd]" string => "8";

any::
- "process_term[agent]" string => "5";
- "process_kill[agent]" string => "8";
+ "process_term[agent]" string => "8";
+ "process_kill[agent]" string => "16";
"binaries"            slist  => getindices("process_term");@

This is not sufficient since it'll also raise the limits on all containers, i just don't know a more appropriate fix.

Actions

Also available in: Atom PDF