Bug #10019: send-clean.sh blocks on inventory parsing error - Rudder - Issue Tracker

Actions

Copy link

Bug #10019

closed

send-clean.sh blocks on inventory parsing error

Added by Janos Mattyasovszky over 7 years ago. Updated about 6 years ago.

Status:

Released

Priority:

N/A

Assignee:

Benoît PECCATTE

Category:

System integration

Target version:

4.1.10

Pull Request:

https://github.com/Normation/rudder-t...

Severity:

Major - prevents use of part of Rudder | no simple workaround

UX impact:

User visibility:

Infrequent - complex configurations | third party integrations

Effort required:

Small

Priority:

Name check:

Fix check:

Regression:

Description

When there is a bug in the inventory parsing, and the a Java Exception is thrown by the endpoint, and the POST request returns a 500 error, but this is handled by send-clean.sh as if the endpoint would throw a 503 (too busy):

See: https://github.com/Normation/rudder-techniques/blob/master/tools/send-clean.sh#L88

It would make sense to only try to upload it "n" times or add somekind of queueing, so that the agent does not always try to re-upload the same (possibly) broken inventory, because that could cause the upload process of always fail on the same inventory, rendering the complete inventory-system dead...

Actions

Copy link

Updated by Vincent MEMBRÉ about 7 years ago

Category set to System integration
Status changed from New to Discussion
Assignee set to Benoît PECCATTE

send-clean.sh is stateless and do not keep track of what happened, so it may be quite complicated to change it directly.

Maybe inventory should be moved to a "retry" folder that would either: try every 30 minutes to send them / Try one more time then send to fail inventories ?

Benoit do you have more ideas ?

Actions

Copy link

Updated by Janos Mattyasovszky about 7 years ago

Well, it does have some minimal state, because it checks the inventory to be older than X minutes, to ensure the signature was also uploaded, it one was sent on it's way. You could say it is only retried if it is not older than X minutes/hours, if yes, fail it. The downside would be, that if the inventory endpoint is saturated, you'd throw away inventories, so there should be a better check than the is it 5xx or not, so you can handle a plain 500 different from a 503, since 500 could mean the inventory breaks the parsing and should not be tried more than X times, and a 503 means that it has nothing to do with the inventory, but the saturation of the endpoint.

Actions

Copy link

Updated by Benoît PECCATTE about 7 years ago

Severity set to Minor - inconvenience | misleading | easy workaround
User visibility set to Infrequent - complex configurations | third party integrations

Actions

Copy link

Updated by Jonathan CLARKE about 7 years ago

Effort required set to Small

Actions

Copy link

Updated by Benoît PECCATTE about 7 years ago

Priority set to 14

Actions

Copy link

Updated by Janos Mattyasovszky about 7 years ago

Well, it's not completely stateless, since it only sends the inventories, if (IIRC) 2mins have elapsed, so any possible delayed signatures can also be uploaded if the sender supports signing inventories.

The same idea could also be used to not try to send inventories that are older than (pi/random) 4 hours.

This could ensure that you don't saturate your sending queue of (default) 50 with broken inventories and never process anything never, and they just keep piling up.

Actions

Copy link

Updated by Jonathan CLARKE almost 7 years ago

Status changed from Discussion to New

Actions

Copy link

Updated by Jonathan CLARKE almost 7 years ago

Assignee deleted (~~Benoît PECCATTE~~)

Actions

Copy link

#10

Updated by Benoît PECCATTE almost 7 years ago

Priority changed from 14 to 24

Actions

Copy link

#11

Updated by Benoît PECCATTE over 6 years ago

Priority changed from 24 to 32

Actions

Copy link

#12

Updated by Janos Mattyasovszky over 6 years ago

Priority changed from 32 to 31

Florian found this as well: As soon you have 50 failed inventories, you get no inventory uploaded, because they get retried until the queue is saturated and send-clean exists with 1...

A valid idea would be to put all failed to a failed folder, and always process those after all the "fresh" ones, so you make sure you always try to process those which have not failed before

Actions

Copy link

#13

Updated by Florian Heigl over 6 years ago

Severity changed from Minor - inconvenience | misleading | easy workaround to Major - prevents use of part of Rudder | no simple workaround
Priority changed from 31 to 48

I found something, not exactly the same.

The inventory processing aborts once an exit 1 is propagated from send_clean.
This happens for any bad inventory that is moved to "failed"

The bad inventories are normally distributed among all inventories.
That means each inventory run can hit one.
At that point this run aborts, so it will process not 50 inventories, but some number between 1 and 50.
Without doing too much math there is a point where the remaining runs are not sufficient to process the non-broken inventories.
This, of course keeps repeating forever.

Simply put, if you inject 200 bad inventories over the name space of all inventories you will NEVER AGAIN reach a point where you'd have working inventories.

Please note to fix that the caller (policy that aborts the loop) should be fixed, not send_clean itself. Alternatively, both could be fixed to have a special return code for this case.
There are surely some cases where it's valid to just abort.

It is not valid to abort "upload inventories" because of this though, since uploading DOES work.
(unless all failed, which would be a good reason to abort and be verbosely sad)

I'm changing this to SEV: major, and would suggest to consider turning it into critical. It could even cause data loss if there's some very very bad coincidence. (Accepted nodes not getting updates, not for new nodes since they will probably not even get in)
Visibility is infrequent, can only happen on setups > 288 hosts though.

Actions

Copy link

#14

Updated by Janos Matya over 6 years ago

The problem is that the api returns 5xx on a Java issue (the ugly way heapdumping) and also on queue full.

Actions

Copy link

#15

Updated by Benoît PECCATTE about 6 years ago

Target version set to 4.1.10
Priority changed from 48 to 47

Actions

Copy link

#16

Updated by Benoît PECCATTE about 6 years ago

Status changed from New to In progress
Assignee set to Benoît PECCATTE

Actions

Copy link

#17

Updated by Benoît PECCATTE about 6 years ago

Status changed from In progress to Pending technical review
Assignee changed from Benoît PECCATTE to Alexis Mousset
Pull Request set to https://github.com/Normation/rudder-techniques/pull/1249

PR https://github.com/Normation/rudder-techniques/pull/1249

Actions

Copy link

#18