Project

General

Profile

Actions

Bug #3902

closed

Postgres performance problem when loading nodeManager > nodes

Added by Andrew Cranson over 10 years ago. Updated over 10 years ago.

Status:
Rejected
Priority:
N/A
Assignee:
-
Category:
Web - Nodes & inventories
Target version:
-
Severity:
UX impact:
User visibility:
Effort required:
Priority:
Name check:
Fix check:
Regression:

Description

I'm experiencing some problems when loading https://RudderURL/rudder/secure/nodeManager/nodes due to what appears to be database performance issues.

There are some strange problems where if RudderSysEvents table has a result the query is returned in <1ms, but if there is NO result it's taking ~2500ms!

Example where a node is returned:

rudder=> EXPLAIN ANALYZE VERBOSE SELECT executiondate, nodeid, ruleid, directiveid, serial, component, keyValue, executionTimeStamp, eventtype, policy, msg FROM RudderSysEvents WHERE 1 = 1 AND nodeid = '71059a73-5599-4050-ac1c-a53e948b01e5' ORDER BY executionTimeStamp DESC LIMIT 1;
                                                                            QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..8.59 rows=1 width=180) (actual time=0.726..0.726 rows=1 loops=1)
   Output: executiondate, nodeid, ruleid, directiveid, serial, component, keyvalue, executiontimestamp, eventtype, policy, msg
   ->  Index Scan Backward using executiontimestamp_idx on ruddersysevents  (cost=0.00..494356.36 rows=57554 width=180) (actual time=0.724..0.724 rows=1 loops=1)
         Output: executiondate, nodeid, ruleid, directiveid, serial, component, keyvalue, executiontimestamp, eventtype, policy, msg
         Filter: (nodeid = '71059a73-5599-4050-ac1c-a53e948b01e5'::text)
 Total runtime: 0.774 ms
(6 rows)

This returns 1 result and takes just around 0.5-0.8ms to load each time.

rudder=> EXPLAIN ANALYZE VERBOSE SELECT executiondate, nodeid, ruleid, directiveid, serial, component, keyValue, executionTimeStamp, eventtype, policy, msg FROM RudderSysEvents WHERE 1 = 1 AND nodeid = '96098873-b38b-4342-bc6a-80bf1d03db1b' ORDER BY executionTimeStamp DESC LIMIT 1;
                                                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..30.15 rows=1 width=180) (actual time=2276.480..2276.480 rows=0 loops=1)
   Output: executiondate, nodeid, ruleid, directiveid, serial, component, keyvalue, executiontimestamp, eventtype, policy, msg
   ->  Index Scan Backward using executiontimestamp_idx on ruddersysevents  (cost=0.00..494356.36 rows=16398 width=180) (actual time=2276.478..2276.478 rows=0 loops=1)
         Output: executiondate, nodeid, ruleid, directiveid, serial, component, keyvalue, executiontimestamp, eventtype, policy, msg
         Filter: (nodeid = '96098873-b38b-4342-bc6a-80bf1d03db1b'::text)
 Total runtime: 2276.531 ms
(6 rows)

This returns zero results, and takes a long time to load (up to 2500ms).

Indexes are enabled and the server has a lot of RAM (around 8GB for Postgres, plenty more available).

We have almost 400 nodes registered in Rudder 2.7.1 (most agents are still 2.7.0) so far and run this on CentOS 6 x86_64 with Postgres 8.4.13 (CentOS version).

The impact of this is a very slow loading node list, but I'm not sure of two things:

#Is this strange behaviour (it seems to me like it is) and do you have any idea why it occurs? Maybe it's a Postgres bug, or even normal behaviour?
#Why is our Rudder policy server running these queries on nodes which don't exist in RudderSysEvents table? This is a Rudder bug, intended behaviour?

Thanks very much for your help.


Related issues 1 (0 open1 closed)

Is duplicate of Rudder - Bug #3674: Missing index on DB for "reports by nodes" leads to timeout for node list pageReleasedFrançois ARMANDActions
Actions #1

Updated by Nicolas CHARLES over 10 years ago

Thank you for you really clear bug report !
This is a problem that was identified on ticket http://www.rudder-project.org/redmine/issues/3674

Basically, the indexes are on execution date, not on couple node,execution time
So of a node doesn't have reports in a common time area (often, latest), then postgres resolves on scanning all the database

The fixe hasn't been included on the release yet, has we try to figure if it would have bad side effects

Actions #2

Updated by Nicolas CHARLES over 10 years ago

For you second question, it's an intended behavior: we want to know for every nodes when we did receive last messages, so we have to look at the last reports, and so far (it will change in 2.8 probably) the only way is to look at the database

Actions #3

Updated by Andrew Cranson over 10 years ago

Thanks a lot for the quick reply!

Sorry I didn't realise Bug #3674 was the same, I thought it was unrelated from the summary but indeed having just read it in full I agree this is the same issue. We're also having problems with "the server cannot be contacted at this time" but I didn't get a chance to debug that yet so I'm pleased (I think ;) that they're related. Please feel free to reject this bug and/or make it a duplicate and I'll continue discussion in #3674. :)

Actions #4

Updated by François ARMAND over 10 years ago

  • Status changed from New to Rejected

It's in fact a duplicate of #3674

Actions

Also available in: Atom PDF