Node's FQDN-Resolution is sometimes invalid
- Do we use the same DNS Server as the policy server?
- Do we use the same IP to reach the policy server as the default gateway's IP's FQDN? (issue present on multi-homed servers)
- Does the external DNS also know me by the name I think I have? (SUSE's openstack assigns some bogus hostnames like external.server.fqdn)
I would suggest to enable a "hook"-binary, that if it exists, it is executed to resolve the FQDN of the node being run on, since the sysadmin might have a better knowledge on how the nodes are named. Or it should at least be customizable via a local variable, but the binary-approach would actually allow you to use a custom logic to reilable resolve how the node is called, even if it is renamed.
#3 Updated by François ARMAND over 2 years ago
This is actually the best idea we had on the subject, thanks Janos !
And it could be rather easy, because it "just" need a hook in FusionInventory/Agent/Task/Inventory/Generic.pm to call script (path defined by convention) if present, and use its result if success. It would also helps to support special hardware: call the script that only handle the cases it knows how to handle.
There is a question remaining about the distribution of that script. Of course, it could be defined by Rudder once the node is accepted, but the use case is mostly to make a valid inventory, so to make things correct on the first inventory.
#4 Updated by Janos Mattyasovszky over 2 years ago
It's a chicken-egg problem, this will be something that the sysadmin has to install during the rudder agents install. This issue way is have also suggested a separate variable (maybe content of a rudder specific new config file, which will then be overruled by a hook binary if it exists).
So as of the suggestion I had was:
- if a hook binary exists, execute it, if it succeeded, take the output as the fqdn for the inventory.
- if a rudder specific configuration file exists, take the first line as the fqdn
- last: fall back to the method used up until now.
This would provide a nice way to influence the behavior by either setting it hard to a specific value (like for initial rollout/installation), then later remove the file, and place an executable there by a rudder rule.
#6 Updated by François ARMAND over 2 years ago
We were reaching the same conclusion. In fact, this is somehow linked with the bigger picture of letting the user add / change arbitrary information on inventory if he wants so, see #4670. But the hostname is perhaps a bit peculiar, since it is also used (for now) for node identification and authentication for promise distribution.
So, your 3 steps process seems good.
One question, thought. Is it more or less convenient, or just an an other (fourth) option, to look for an environment variable (for. ex RUDDER_FQDN)?
#8 Updated by Florian Heigl over 2 years ago
imo most of these cases are going back to
they try to have it fixed in libnet
- libnet has a lot of open bugs to that end and is not fully maintained
- libnet is doing the right thing in it's own sense - the dns lookup doesn't work and that's what it's return is based on
- fusion does use that mangled non-return and PROCESSES it, ending up with the empty hostname
- fusion should not(!) use external dns to strip-check the OS hostname
- if can do that in addition but the primary source for a system's hostname is the system's hostname!
- round one: look at hostname
- round two: then look at hostname --fqdn
- round two: then look at the remote-looked up fqdn minus domain (the more fragile bit)
accept the longest result that has any overlap with one from round two.
If there is no overlap, fall back to only using the hostname.
That would also be the right place for the trigger, to subvert THIS behaviour.
But there has to be a defined default behavior to come up with a filled entry for <HOSTNAME>
#9 Updated by François ARMAND over 2 years ago
Some more information from discussion on #irc channel:
- in all cases, we must have a post-check validation on the agent before sending the inventory to at least ensure that a valid hosname/fqdn entry is filled. Without that check, the inventory will be correctly sent, so the node will NOT issue a new inventory before 24h, BUT we are sur that the inventory WON'T be accepted by the server. On the other hand, if the node see an error with regards to the sending of the inventory, it will retry in the next run, and on the next, until the post-check is valid. That allows to wait for a DNS config to propagate, for example.
- there is nonetheless other bug in fusion, which could be better addressed than they are now (see previous comment).
#20 Updated by François ARMAND about 2 years ago
So, to recap:
We want to let the user specify a value for RUDDER/HOSTNAME (which is actually a FQDN...). The value to put is looked in order, stoping at the first success:
- in RUDDER_FQDN environment variable if exists, and is not empty
- if the command TODO_SPECIFY_PATH_CONVENTION_FOR_FQDN_COMMAND exists, is executable, return correctly (no error code), and return a non-empty string
- if the file TODO_SPECIFY_PATH_CONVENTION_FOR_FQDN_CONFIG_FILE exists, is readable, is non empty, take the first valid line (question: do we want to allow command ? Key/value ? or just a file with only the FQDN on first line ?)
- else, fall back to current method.
#22 Updated by Janos Mattyasovszky about 2 years ago
Regarding the questions:
I think the config file should be somewhere in /etc, like /etc/rudder.conf, and it would be make sense to make it extendable, so using key=value where key could be the same as the environment variable, like:
#39 Updated by François ARMAND about 1 year ago
- Severity changed from Major - prevents use of part of Rudder | no simple workaround to Critical - prevents main use of Rudder | no workaround | data loss | security
- User visibility changed from Getting started - demo | first install | level 1 Techniques to Operational - other Techniques | Technique editor | Rudder settings
- Effort required set to Large
- Priority set to 38
In 4.1, we are using agent key to identify the nodes, so the problem does not occur anymore.
We still don't have a good solution for 3.1.