I want to select only the records from table Stock based on the column PostingDate. The PostingDate should be after the InitDate in another table called InitClient. However, there are currently 2 clients in both tables (client 1 and client 2), that both have a different InitDate.

With the code below I get exactly what I need currently, based on the sample data also included underneath. However, two problems arise, first of all based on millions of records the query is taking way too long (hours). And second of all, it isn't dynamic at all, every time when a new client is included.

A potential option to cover the performance issue would be to write two separate query's, one for Client 1 and one for Client 2 with a UNION in between. Unfortunately, this then isn't dynamic enough since multiple clients are possible.

    SELECT      
        Material
        ,Stock
        ,Stock.PostingDate
        ,Stock.Client
    FROM Stock
    LEFT JOIN (SELECT InitDate FROM InitClient where Client = 1) C1 ON 1=1
    LEFT JOIN (SELECT InitDate FROM InitClient where Client = 2) C2 ON 1=1  
    WHERE   
        (
            (Stock.Client = 1 AND Stock.PostingDate > C1.InitDate) OR
            (Stock.Client = 2 AND Stock.PostingDate > C2.InitDate)
        )

Sample dataset:

CREATE TABLE InitClient
(
    Client  varchar(300),
    InitDate    date
);
INSERT INTO InitClient (Client,InitDate)
VALUES
    ('1', '5/1/2021'),
    ('2', '1/31/2021');
SELECT * FROM InitClient

CREATE TABLE Stock 
(
    Material    varchar(300),
    PostingDate varchar(300),
    Stock   varchar(300),
    Client  varchar(300)
);
INSERT INTO Stock (Material,PostingDate,Stock,Client)
VALUES
    ('322', '1/1/2021', '5', '1'),
    ('101', '2/1/2021', '5', '2'),
    ('322', '3/2/2021', '10', '1'),
    ('101', '4/13/2021', '5', '1'),
    ('400', '5/11/2021', '170', '2'),
    ('401', '6/20/2021', '200', '1'),
    ('322', '7/20/2021', '160', '2'),
    ('400', '8/9/2021', '93', '2');
SELECT * FROM Stock

Desired result, but then with a substitute for the OR statement to ramp up the performance:

| Material | PostingDate | Stock | Client |
|----------|-------------|-------|--------|
| 322      | 1/1/2021    | 5     | 1      |
| 101      | 2/1/2021    | 5     | 2      |
| 322      | 3/2/2021    | 10    | 1      |
| 101      | 4/13/2021   | 5     | 1      |
| 400      | 5/11/2021   | 170   | 2      |
| 401      | 6/20/2021   | 200   | 1      |
| 322      | 7/20/2021   | 160   | 2      |
| 400      | 8/9/2021    | 93    | 2      |

Any suggestions if there is an substitute possible in the above code to keep performance, while making it dynamic?

CodePudding user response：

There is nothing wrong in expecting your query to be dynamic. However, in order to make it more performant, you may need to reach a compromise between to conflicting expectations. I will present here a few ways to optimize your query, some of them involves some drastic changes, but eventually it is you or your client who decides how this needs to be improved. Also, some of the improvements might be ineffective, so do not take anything for granted, test everything. Without further ado, let's see the suggestions

The query

First I would try to change the query a little, maybe something like this could help you

SELECT      
    Material
    ,Stock
    ,Stock.PostingDate
    ,C1.InitDate
    ,C2.InitDate
    ,Stock.Client
FROM Stock
LEFT JOIN InitClient C1 ON Client = 1
LEFT JOIN InitClient C2 ON Client = 2
WHERE   
    (
        (Stock.Client = 1 AND Stock.PostingDate > C1.InitDate) OR
        (Stock.Client = 2 AND Stock.PostingDate > C2.InitDate)
    )

Sometimes a simple step of getting rid of subselects does the trick

The indexes

You may want to speed up your process by creating indexes, for example on Stock.PostingDate.

Helper table

You can create a helper table where you store the Stock records' relevant data, so you perform the slow query ONCE in a while, maybe once in a week, or each time a new client enters the stage and store the results in the helper table. Once the prerequisite calculation is done, you will be able to query only the helper table with its few records, reaching lightning fast behavior. So, the idea is to execute the slow query rarely, cache/store the results and reuse them instead of calculating it every time.

A new column

You could create a column in your Stock table named InitDate and fill that with data for each record periodically. It will take a long while at the first execution, but then you will be able to query only the Stock table without joins and subselects.

CodePudding user response：

You can optimize this query quite a bit.

Firstly, those two LEFT JOINs are basically just semi-joins, because you don't actually return any results from them. So we can turn them into a single EXISTS.
You will also get an implicit conversion to int, because Client is varchar and 1,2 is an int. So change that to '1','2', or you could change the column type.
PostingDate is also varchar, that should really be date

SELECT      
    s.Material
    ,s.Stock
    ,s.PostingDate
    ,s.Client
FROM Stock s
WHERE s.Client IN ('1','2')
  AND EXISTS (SELECT 1
    FROM InitClient c
    WHERE s.PostingDate > c.InitDate
      AND c.Client = s.Client
);

Next you want to look at indexing. For this query (not accounting for any other queries being run), you probably want the following indexes (remove the INCLUDE for a clustered index)

InitClient (Client, InitDate)

Stock (Client) INCLUDE (PostingDate, Material, Stock)

It is possible that even with these indexes that you may get a scan on Stock, because IN functions like an OR. This does not always happen, it's worth checking. If so, instead you can rewrite this to use UNION ALL

SELECT      
    s.Material
    ,s.Stock
    ,s.PostingDate
    ,s.Client
FROM (
    SELECT *
    FROM Stock s
    WHERE s.Client = '1'
    UNION ALL
    SELECT *
    FROM Stock s
    WHERE s.Client = '2'

) s
WHERE EXISTS (SELECT 1
    FROM InitClient c
    WHERE s.PostingDate > c.InitDate
      AND c.Client = s.Client
);

db<>fiddle