Greenplum COUNT(DISTINCT) produces different results for identical iterations of a query on a static-CodePudding

I am using a Greenplum db (v 6.22.1) running PSQL (v 9.4.26) on a small linux server. I am analyzing ~36 million rows of data distributed among 6 segments. I want to count the unique number of users per month, with a fairly straightforward query:

SELECT month, COUNT(DISTINCT(user_id)) AS no_users
FROM users_table
GROUP BY month
ORDER BY month

As expected, I get output that looks like this:

month	no_users
2022-08-01	490005
2022-09-01	510015

However, when I rerun the query above (with no changes to data), I get the following result:

month	no_users
2022-08-01	490020
2022-09-01	510001

This happens every time the query is rerun, with no obvious pattern between the results each time. I have rebuilt the table as well as reuploaded the data to no avail. Does anyone have any idea as to what might be causing this problem? I have resorted to googling "Greenplum broken" but as anticipated that has not got me far.

CodePudding user response：

Did you build GPDB yourself? I would ask this in the GPDB users email list or slack channel. Developers might have a better clue.

CodePudding user response：

Try this:

VACUUM (VERBOSE, ANALYZE) users_table;

Check data distribution over segments:

SELECT count(*) c, gp_segment_id FROM users_table GROUP BY 2;

Double-check that the query result equals your COUNT(*) :

select sum(c) from (SELECT count(*) c, gp_segment_id FROM users_table GROUP BY 2) x;

should equal :

select count(*) from users_table;