Project

General

Profile

Actions

Bug #10019

closed

send-clean.sh blocks on inventory parsing error

Added by Janos Mattyasovszky over 7 years ago. Updated about 6 years ago.

Status:
Released
Priority:
N/A
Category:
System integration
Target version:
Severity:
Major - prevents use of part of Rudder | no simple workaround
UX impact:
User visibility:
Infrequent - complex configurations | third party integrations
Effort required:
Small
Priority:
47
Name check:
Fix check:
Regression:

Description

When there is a bug in the inventory parsing, and the a Java Exception is thrown by the endpoint, and the POST request returns a 500 error, but this is handled by send-clean.sh as if the endpoint would throw a 503 (too busy):

See: https://github.com/Normation/rudder-techniques/blob/master/tools/send-clean.sh#L88

It would make sense to only try to upload it "n" times or add somekind of queueing, so that the agent does not always try to re-upload the same (possibly) broken inventory, because that could cause the upload process of always fail on the same inventory, rendering the complete inventory-system dead...

Actions #1

Updated by Vincent MEMBRÉ about 7 years ago

  • Category set to System integration
  • Status changed from New to Discussion
  • Assignee set to Benoît PECCATTE

send-clean.sh is stateless and do not keep track of what happened, so it may be quite complicated to change it directly.

Maybe inventory should be moved to a "retry" folder that would either: try every 30 minutes to send them / Try one more time then send to fail inventories ?

Benoit do you have more ideas ?

Actions #2

Updated by Janos Mattyasovszky about 7 years ago

Well, it does have some minimal state, because it checks the inventory to be older than X minutes, to ensure the signature was also uploaded, it one was sent on it's way. You could say it is only retried if it is not older than X minutes/hours, if yes, fail it. The downside would be, that if the inventory endpoint is saturated, you'd throw away inventories, so there should be a better check than the is it 5xx or not, so you can handle a plain 500 different from a 503, since 500 could mean the inventory breaks the parsing and should not be tried more than X times, and a 503 means that it has nothing to do with the inventory, but the saturation of the endpoint.

Actions #3

Updated by Benoît PECCATTE about 7 years ago

  • Severity set to Minor - inconvenience | misleading | easy workaround
  • User visibility set to Infrequent - complex configurations | third party integrations
Actions #5

Updated by Jonathan CLARKE about 7 years ago

  • Effort required set to Small
Actions #6

Updated by Benoît PECCATTE about 7 years ago

  • Priority set to 14
Actions #7

Updated by Janos Mattyasovszky about 7 years ago

Well, it's not completely stateless, since it only sends the inventories, if (IIRC) 2mins have elapsed, so any possible delayed signatures can also be uploaded if the sender supports signing inventories.

The same idea could also be used to not try to send inventories that are older than (pi/random) 4 hours.

This could ensure that you don't saturate your sending queue of (default) 50 with broken inventories and never process anything never, and they just keep piling up.

Actions #8

Updated by Jonathan CLARKE almost 7 years ago

  • Status changed from Discussion to New
Actions #9

Updated by Jonathan CLARKE almost 7 years ago

  • Assignee deleted (Benoît PECCATTE)
Actions #10

Updated by Benoît PECCATTE almost 7 years ago

  • Priority changed from 14 to 24
Actions #11

Updated by Benoît PECCATTE over 6 years ago

  • Priority changed from 24 to 32
Actions #12

Updated by Janos Mattyasovszky over 6 years ago

  • Priority changed from 32 to 31

Florian found this as well: As soon you have 50 failed inventories, you get no inventory uploaded, because they get retried until the queue is saturated and send-clean exists with 1...

A valid idea would be to put all failed to a failed folder, and always process those after all the "fresh" ones, so you make sure you always try to process those which have not failed before

Actions #13

Updated by Florian Heigl over 6 years ago

  • Severity changed from Minor - inconvenience | misleading | easy workaround to Major - prevents use of part of Rudder | no simple workaround
  • Priority changed from 31 to 48

I found something, not exactly the same.

The inventory processing aborts once an exit 1 is propagated from send_clean.
This happens for any bad inventory that is moved to "failed"

The bad inventories are normally distributed among all inventories.
That means each inventory run can hit one.
At that point this run aborts, so it will process not 50 inventories, but some number between 1 and 50.
Without doing too much math there is a point where the remaining runs are not sufficient to process the non-broken inventories.
This, of course keeps repeating forever.

Simply put, if you inject 200 bad inventories over the name space of all inventories you will NEVER AGAIN reach a point where you'd have working inventories.

Please note to fix that the caller (policy that aborts the loop) should be fixed, not send_clean itself. Alternatively, both could be fixed to have a special return code for this case.
There are surely some cases where it's valid to just abort.

It is not valid to abort "upload inventories" because of this though, since uploading DOES work.
(unless all failed, which would be a good reason to abort and be verbosely sad)

I'm changing this to SEV: major, and would suggest to consider turning it into critical. It could even cause data loss if there's some very very bad coincidence. (Accepted nodes not getting updates, not for new nodes since they will probably not even get in)
Visibility is infrequent, can only happen on setups > 288 hosts though.

Actions #14

Updated by Janos Matya over 6 years ago

The problem is that the api returns 5xx on a Java issue (the ugly way heapdumping) and also on queue full.

Actions #15

Updated by Benoît PECCATTE about 6 years ago

  • Target version set to 4.1.10
  • Priority changed from 48 to 47
Actions #16

Updated by Benoît PECCATTE about 6 years ago

  • Status changed from New to In progress
  • Assignee set to Benoît PECCATTE
Actions #17

Updated by Benoît PECCATTE about 6 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Benoît PECCATTE to Alexis Mousset
  • Pull Request set to https://github.com/Normation/rudder-techniques/pull/1249
Actions #18

Updated by Rudder Quality Assistant about 6 years ago

  • Assignee changed from Alexis Mousset to Benoît PECCATTE
Actions #19

Updated by Benoît PECCATTE about 6 years ago

  • Status changed from Pending technical review to Pending release
Actions #20

Updated by Vincent MEMBRÉ about 6 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 4.1.10 and 4.2.4 which were released today.

Actions

Also available in: Atom PDF