I am using a Greenplum db (v 6.22.1) running PSQL (v 9.4.26) on a small linux server. I am analyzing ~36 million rows of data distributed among 6 segments. I want to count the unique number of users per month, with a fairly straightforward query:
SELECT month, COUNT(DISTINCT(user_id)) AS no_users
FROM users_table
GROUP BY month
ORDER BY month
As expected, I get output that looks like this:
month | no_users |
---|---|
2022-08-01 | 490005 |
2022-09-01 | 510015 |
However, when I rerun the query above (with no changes to data), I get the following result:
month | no_users |
---|---|
2022-08-01 | 490020 |
2022-09-01 | 510001 |
This happens every time the query is rerun, with no obvious pattern between the results each time. I have rebuilt the table as well as reuploaded the data to no avail. Does anyone have any idea as to what might be causing this problem? I have resorted to googling "Greenplum broken" but as anticipated that has not got me far.
CodePudding user response:
Did you build GPDB yourself? I would ask this in the GPDB users email list or slack channel. Developers might have a better clue.
CodePudding user response:
Try this:
VACUUM (VERBOSE, ANALYZE) users_table;
Check data distribution over segments:
SELECT count(*) c, gp_segment_id FROM users_table GROUP BY 2;
Double-check that the query result equals your COUNT(*) :
select sum(c) from (SELECT count(*) c, gp_segment_id FROM users_table GROUP BY 2) x;
should equal :
select count(*) from users_table;