send-clean.sh blocks on inventory parsing error
When there is a bug in the inventory parsing, and the a Java Exception is thrown by the endpoint, and the POST request returns a 500 error, but this is handled by send-clean.sh as if the endpoint would throw a 503 (too busy):
It would make sense to only try to upload it "n" times or add somekind of queueing, so that the agent does not always try to re-upload the same (possibly) broken inventory, because that could cause the upload process of always fail on the same inventory, rendering the complete inventory-system dead...
#1 Updated by Vincent MEMBRÉ over 1 year ago
- Category set to System integration
- Status changed from New to Discussion
- Assignee set to Benoît PECCATTE
send-clean.sh is stateless and do not keep track of what happened, so it may be quite complicated to change it directly.
Maybe inventory should be moved to a "retry" folder that would either: try every 30 minutes to send them / Try one more time then send to fail inventories ?
Benoit do you have more ideas ?
#2 Updated by Janos Mattyasovszky over 1 year ago
Well, it does have some minimal state, because it checks the inventory to be older than X minutes, to ensure the signature was also uploaded, it one was sent on it's way. You could say it is only retried if it is not older than X minutes/hours, if yes, fail it. The downside would be, that if the inventory endpoint is saturated, you'd throw away inventories, so there should be a better check than the is it 5xx or not, so you can handle a plain 500 different from a 503, since 500 could mean the inventory breaks the parsing and should not be tried more than X times, and a 503 means that it has nothing to do with the inventory, but the saturation of the endpoint.
#7 Updated by Janos Mattyasovszky about 1 year ago
Well, it's not completely stateless, since it only sends the inventories, if (IIRC) 2mins have elapsed, so any possible delayed signatures can also be uploaded if the sender supports signing inventories.
The same idea could also be used to not try to send inventories that are older than (pi/random) 4 hours.
This could ensure that you don't saturate your sending queue of (default) 50 with broken inventories and never process anything never, and they just keep piling up.
#12 Updated by Janos Mattyasovszky 6 months ago
- Priority changed from 32 to 31
Florian found this as well: As soon you have 50 failed inventories, you get no inventory uploaded, because they get retried until the queue is saturated and send-clean exists with 1...
A valid idea would be to put all failed to a
failed folder, and always process those after all the "fresh" ones, so you make sure you always try to process those which have not failed before
#13 Updated by Florian Heigl 6 months ago
- Severity changed from Minor - inconvenience | misleading | easy workaround to Major - prevents use of part of Rudder | no simple workaround
- Priority changed from 31 to 48
I found something, not exactly the same.
The inventory processing aborts once an exit 1 is propagated from send_clean.
This happens for any bad inventory that is moved to "failed"
The bad inventories are normally distributed among all inventories.
That means each inventory run can hit one.
At that point this run aborts, so it will process not 50 inventories, but some number between 1 and 50.
Without doing too much math there is a point where the remaining runs are not sufficient to process the non-broken inventories.
This, of course keeps repeating forever.
Simply put, if you inject 200 bad inventories over the name space of all inventories you will NEVER AGAIN reach a point where you'd have working inventories.
Please note to fix that the caller (policy that aborts the loop) should be fixed, not send_clean itself. Alternatively, both could be fixed to have a special return code for this case.
There are surely some cases where it's valid to just abort.
It is not valid to abort "upload inventories" because of this though, since uploading DOES work.
(unless all failed, which would be a good reason to abort and be verbosely sad)
I'm changing this to SEV: major, and would suggest to consider turning it into critical. It could even cause data loss if there's some very very bad coincidence. (Accepted nodes not getting updates, not for new nodes since they will probably not even get in)
Visibility is infrequent, can only happen on setups > 288 hosts though.
#17 Updated by Benoît PECCATTE 4 months ago
- Status changed from In progress to Pending technical review
- Assignee changed from Benoît PECCATTE to Alexis MOUSSET
- Pull Request set to https://github.com/Normation/rudder-techniques/pull/1249
#19 Updated by Benoît PECCATTE 4 months ago
- Status changed from Pending technical review to Pending release
Applied in changeset rudder-techniques|508a8a66244a0fac4c54e018509f8be9e39d1b67.