[Return result is incorrect]Limit is incorrect in Join statement in AWS Redshift(RedShift)-CodePudding

The querying resultset is incorrect when running join statement which has limit operater.

One of subquery:

SELECT A3.customerid FROM b1traderecords A3 WHERE A3.customerid  < 100 limit 5

customerid 
-----------
         71
         88
         11
         99
         44

  5 record(s) selected.

The join statement which includes the subquery:


select A2.customerid from (SELECT A3.customerid FROM b1traderecords A3 WHERE A3.customerid  < 100 limit 5) A0, (select customerid from b3customerinfo where customerrating > 0.7) A2 where A0.customerid = A2.customerid

customerid 
-----------
         88
         44
         88
          9
         90

  5 record(s) selected.

The return value "9" is not in first subquery result So, the join resultset seems incorrect.

Is it a bug? Any suggestions? Thanks

CodePudding user response：

This is expected behavior. Redshift is a cluster where the different compute nodes (and slices) operate independently. Different rows of data are distributed around the "slices" of the cluster so each has different data. When you put a small limit like this on the query whichever slice's data arrives first will pass through the limit and the rest will be lost. So there is a "race" between the slices to see which one returns data first. Who "wins" can change for all sorts of reasons.

To achieve predicable results from a LIMIT'ed query you need an ORDER BY clause.

################### Update

With the order clause in the subquery with the LIMIT (or TOP) the results become predictable but there is still something amiss. I've recreated your test case and tried these 3 versions of your query.

select A2.i1, a1.i1 
from (select top 2 i1 from ffnr_i1 order by i1) A1 
, (select i1 from ffnr_i2 ) A2 
where A1.i1 = A2.i1 and A2.i1 > 20;

select A2.i1, a1.i1 
from (select top 2 i1 from ffnr_i1 order by i1) A1 
, (select i1 from ffnr_i2 ) A2 
where A1.i1 = A2.i1;

select A2.i1, a1.i1 
from (select top 2 i1 from ffnr_i1 order by i1) A1 
, (select i1 from ffnr_i2 ) A2 
where A2.i1 > 20;

As well as run EXPLAIN on the queries. The first one is just moving the subquery where clause to the top level - functionally equivalent. The explain plan shows the i1 > 20 being applied to the first subquery through the a1.i1 = a2.i1 requirement. It is being applied after the limit so shouldn't cause a problem.

The second example produces the expected result - no matches. The only change in the explain plan is that there is no longer a > 20 where clause on either of the tables. No surprise there.

The third example also produces what you would expect - the cross join of A2 with the first 2 values of A1. The only difference in the explain plans is the nested loop join this requires and that A1 is now evaluated first.

So as near as tell it is taking the LIMIT excluding the matching value, the WHERE cause on the joining column, and the implied WHERE clause on the other table. This does seem like the query optimizer is dropping something in this case. You should submit this to AWS as a potential issue / bug.

I rewrote the query with modern join syntax to see if that would help out the optimizer but no luck. I also added more values (25, 26, 27 35, 36, 37) to A1 on the off chance that the WHERE condition was being applied before the LIMIT but this didn't change the issue either.

Unless someone sees something I'm missing I think this is likely a Redshift optimizer bug.

CodePudding user response：

Thanks Bill. When I add "order by before "limit", it seems incorrect either. I think the "order by" can avoid the "race".

select A2.customerid from (SELECT A3.customerid FROM b1traderecords A3 WHERE A3.customerid  < 100 order by 1 limit 5) A0, (select customerid from b3customerinfo where customerrating > 0.7) A2 where A0.customerid = A2.customerid

customerid 
-----------
          7
          9
          9
         26
         26

  5 record(s) selected.

 SELECT A3.customerid FROM b1traderecords A3 WHERE A3.customerid  < 100 order by 1 limit 5

customerid 
-----------
          1
          1
          2
          2
          2

  5 record(s) selected.

Is it correct?

I test some other cases too(update the predicate only in each sub-query), and the issue can be reproduced in some cases, but not all

CodePudding user response：

I simplify the issue for your reproduce:

a> Create Tables ffnr_i1 and ffnr_i2 in RedShift

create table ffnr_i1(i1 int);
insert into ffnr_i1 values(11),(12),(13),(33);

create table ffnr_i2(i1 int);
insert into ffnr_i2 values(21),(22),(23),(33);

b> Query and and reproduce the issue

select * from ffnr_i1

i1         
-----------
         11
         12
         13
         33

  4 record(s) selected.

select * from ffnr_i2

i1         
-----------
         21
         22
         23
         33

  4 record(s) selected.

select i1 from ffnr_i1 order by 1 limit 2

i1         
-----------
         11
         12

  2 record(s) selected.

select A2.i1 from (select i1 from ffnr_i1 order by 1 limit 2) A1, (select i1 from ffnr_i2 where i1 > 20) A2 where A1.i1 = A2.i1

i1         
-----------
         33

  1 record(s) selected.

We can see that the last result set has value "33" which is not included in its subquery(A1) result(11,22)

when I fetch values of "A1.i1", the result is incorrect either.

select A2.i1, A1.i1 from (select i1 from ffnr_i1 order by 1 limit 2) A1, (select i1 from ffnr_i2 where i1 > 20) A2 where A1.i1 = A2.i1

i1          i1         
----------- -----------
         33          33

  1 record(s) selected.

select A1.i1 from (select i1 from ffnr_i1 order by 1 limit 2) A1, (select i1 from ffnr_i2 where i1 > 20) A2 where A1.i1 = A2.i1

i1         
-----------
         33

  1 record(s) selected.

The resultset of join statement expects 0 row

Is it a bug?

Thanks