I have a (slow) query that almost does what I want. The bold section is the pseudo-code I am trying to add. The goal of this query is to get a single row per custnumber with 5 columns of data:
- colA: MKT
- colB: custnumber
- colC: most recent apptdate WHERE src_id = Loss Recovery AND dsp_id = Sale
- colD: most recent apptdate WHERE src_id NOT= Loss Recovery AND dsp_id NOT= Sale and this apptdate is earlier than LRSale
- colE: days between NoSale and LRSale
This is what I have built so far, close to working. I have tried several different ways to get the last bit in but am having real trouble parsing the syntax to make it work right.
SELECT DISTINCT *, DATEDIFF(day,NoSale,LRSold) AS 'Diff'
FROM(SELECT a.MKT,
a.custnumber,
(SELECT TOP 1 CAST(apptdate AS DATE)
FROM prospectissues b
WHERE b.custnumber = a.custnumber
**AND b.apptdate < c.apptdate**
AND b.src_id <> 'Loss Recovery' AND b.dsp_id <> 'Sale'
ORDER BY apptdate DESC) AS 'NoSale',
(SELECT TOP 1 CAST(apptdate AS DATE)
FROM prospectissues c
WHERE c.custnumber = a.custnumber
AND c.src_id = 'Loss Recovery' AND c.dsp_id = 'Sale'
ORDER BY apptdate DESC) AS 'LRSold'
FROM prospectissues a
WHERE CAST(apptdate AS DATE) >= DATEADD(yy, DATEDIFF(yy,0,GETDATE())-3,0)) z
WHERE LRSold IS NOT NULL
Thank you, I appreciate any insight! I'm sure it's something simple that I'm just too tired to see right now. Been banging my head against this for too long.
-edit 7/25
I cannot write to the database, only run Select queries. Here is a sample of the table I'm working with (rows numbered just for easy reference):
| MKT | custnumber | dsp_id | src_id | ApptDate |
----------------------------------------------------------
1 | CLE | 14675 | PMPrice | Email | 3/7/20 |
2 | CLE | 14675 | Sale | Radio | 3/9/20 |
3 | CLE | 246466 | NH | Prior Customer | 3/7/20 |
4 | PIT | 284747 | PMPrice | Show | 3/7/20 |
5 | PIT | 284747 | Sale | Loss Recovery | 3/12/20 |
6 | PIT | 384603 | Sale | SelfGen | 3/3/20 |
7 | PIT | 384603 | Sale | Loss Recovery | 3/8/20 |
8 | COL | 384764 | PMPrice | Website | 3/15/22 |
9 | COL | 384921 | Sale | Website | 3/22/20 |
10 | COL | 385052 | PMPrice | Show | 3/7/20 |
11 | COL | 385052 | Sale | Aggregates | 3/10/20 |
12 | COL | 385052 | Sale | Loss Recovery | 3/18/20 |
13 | PIT | 385662 | OVB | Aggregates | 3/12/20 |
14 | PIT | 385662 | Sale | Loss Recovery | 3/21/20 |
15 | PIT | 385662 | PMPrice | Store | 4/15/20 |
Sample of desired output:
| MKT | custnumber | NoSale | LRSold | Diff |
----------------------------------------------
| PIT | 284747 | 3/7/20 | 3/12/20 | 5 |
| COL | 385052 | 3/7/20 | 3/18/20 | 11 |
| PIT | 385662 | 3/12/20 | 3/21/20 | 9 |
The goal is to visualize how long a time gap (days) exists between Non-sale (NS) records and Loss Recovery - Sale (LR) records.
NS = anything with dsp_id NOT equal to 'Sale' AND src_id NOT equal to 'Loss Recovery' counts as an NS record
LR = anything with dsp_id = 'Sale' AND src_id = 'Loss Recovery'
criteria are:
- MKT: simple text, this will always be consistent between records with the same custnumber
- custnumber: the "key column", always an integer
- NoSale: the most recent NS apptdate prior to a LR apptdate sharing the same custnumber
- LRSale: the most recent LR record grouped by custnumber
- Diff: days between NS and LR
Seems simple in theory, but giving me a hell of a time.
CodePudding user response:
UPDATE:
This has the same problem as the original, where the NoSale is not yet restricted to be less than the LRSold value. Working on that and will update in a few. In the meantime, here's a fiddle with what I have so far, to make it easier for others to play with this:
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=953b2a311e88acc646b02add227f1097
I think this will do it via conditional aggregation, no nested queries needed:
SELECT pi.custnumber, pi.MKT,
MAX(CASE WHEN pi.src_id <> 'Loss Recovery' and pi.dsp_id <> 'Sale' THEN apptdate END) as NoSale,
MAX(CASE WHEN pi.src_id = 'Loss Recovery' and pi.dsp_id = 'Sale' THEN apptdate END) As LRSold,
DATEDIFF(day,
MAX(CASE WHEN pi.src_id <> 'Loss Recovery' and pi.dsp_id <> 'Sale' THEN apptdate END),
MAX(CASE WHEN pi.src_id = 'Loss Recovery' and pi.dsp_id = 'Sale' THEN apptdate END)) As [Diff]
FROM prospectissues pi
WHERE apptdate >= DATEADD(yy, DATEDIFF(yy,0,GETDATE())-3,0)
GROUP BY pi.custnumber, pi.MKT
HAVING MAX(CASE WHEN pi.src_id = 'Loss Recovery' and pi.dsp_id = 'Sale' THEN apptdate ELSE NULL END) IS NOT NULL
AND MAX(CASE WHEN pi.src_id <> 'Loss Recovery' and pi.dsp_id <> 'Sale' THEN apptdate END) IS NOT NUL
If we really want to avoid repeating those long MAX()
functions, we can do one level of nesting:
SELECT custnumber, MKT, NoSale, LRSold, DATEDIFF(day, NoSale, LRSold) [Diff]
FROM (
SELECT pi.custnumber, pi.MKT,
MAX(CASE WHEN pi.src_id <> 'Loss Recovery' and pi.dsp_id <> 'Sale' THEN apptdate END) as NoSale,
MAX(CASE WHEN pi.src_id = 'Loss Recovery' and pi.dsp_id = 'Sale' THEN apptdate END) As LRSold
FROM prospectissues pi
WHERE apptdate >= DATEADD(yy, DATEDIFF(yy,0,GETDATE())-3,0)
GROUP BY pi.custnumber, pi.MKT
HAVING MAX(CASE WHEN pi.src_id = 'Loss Recovery' and pi.dsp_id = 'Sale' THEN apptdate ELSE NULL END) IS NOT NULL
AND MAX(CASE WHEN pi.src_id <> 'Loss Recovery' and pi.dsp_id <> 'Sale' THEN apptdate END) IS NOT NULL
) t
You'll want to check these against your data to make sure the results match expectations, as there's a chance I've misunderstood what some things were supposed to do, especially whether the WHERE
clause limiting to the last three years should also apply to NoSale/LRSold values.
I also asked a question in the comments. Based on the provided sample data, rows 2, 6, 9, and 11 are not part of this calculation anywhere. They don't fit the selection criteria for either column, and so are completely overlooked. Is this by design?
Finally, one thing I noticed is casting appdate
into a DATE value. Given the reliance of the query on this field, this cast is very bad for performance, because it breaks any chance of using an index on the field and the way the query was structed forced the database you to run the conversion **multiple times for every row in the table... even rows that will not be used for the results.
If the column is a datetime
or datetime2
this is easily fixed: don't do the cast. It's not needed for the query to work properly. But if this is a varchar
column, you need to FIX the schema, because it really is BROKEN. The performance issue here is just one of several reasons why, but it is effective at demonstrating why it's so important: just that once change to the column type would likely cut the execution time of the original query by more than an order of magnitude. NEVER use string-typed columns to store time-typed data.
CodePudding user response:
I think this is what you are looking for. It should run quicker by using aggregates like Min
instead of order by
and top 1
.
SELECT a.MKT, a.custnumber,
ns.MinDate AS NoSale, lr.MinDate AS LRSold
DATEDIFF(day, ns.MinDate, lr.MinDate) AS Diff
FROM (SELECT MKT, custnumber, min(apptdate) as MinDate
FROM prospectissues
WHERE src_id <> 'Loss Recovery' AND dsp_id <> 'Sale'
GROUP BY MKT, custnumber) ns
INNER JOIN --should be similar outcome as WHERE LRSold IS NOT NULL
(SELECT MKT, custnumber, min(apptdate) MinDate
FROM prospectissues b
WHERE src_id = 'Loss Recovery' AND dsp_id = 'Sale'
GROUP BY MKT, custnumber) lr
ON ns.custnumber = lr.custnumber AND ns.MKT = lr.MKT
AND ns.apptdate < lr.apptdate --The new part that you want
WHERE CAST(ns.MinDate AS DATE) >= DATEADD(yy, DATEDIFF(yy,0,GETDATE())-3,0)