SQL - applying two different conditions after group by to select rows from sql table-CodePudding

I am willing to filter the rows based on the group of client id, date

So in the group if the latest status as per update date = 'CO' then earliest rows as per update date and if the latest status in ('NonPay','VD','Active' then the latest row as per update date. Table look like: table1

rownum	clientid	date	status	updateDate
1	1234	2021-02-01	CO	2021-02-01
2	1234	2021-02-01	CO	2021-01-01
3	1234	2021-02-01	NonPay	2020-12-01
4	1234	2021-02-03	Active	2021-11-01
5	1234	2021-02-03	CO	2021-10-01
6	1234	2021-02-03	CO	2021-09-01
7	1234	2021-02-04	CO	2021-08-01
8	1234	2021-02-04	VD	2021-07-01
9	4567	2019-06-01	Active	2020-12-28
10	4567	2019-06-01	CO	2020-12-20
11	4567	2019-06-01	NonPay	2020-12-10
12	4567	2019-05-03	VD	2020-12-01
13	4567	2019-05-03	Active	2020-11-01
14	4567	2019-05-03	CO	2020-10-01
15	4567	2019-05-03	NP	2020-09-01
16	4567	2019-04-04	CO	2020-08-01
17	4567	2019-04-04	VD	2020-07-01

So the expected result would look like :

rownum	clientid	date	status	updateDate
3	1234	2021-02-01	NonPay	2020-12-01
4	1234	2021-02-03	Active	2021-11-01
8	1234	2021-02-04	VD	2021-07-01
9	4567	2019-06-01	Active	2020-02-01
12	4567	2019-05-03	VD	2020-12-01
17	4567	2019-04-04	VD	2020-07-01

I tried :

select *,
case when rank_date=1 and status=!='Active' then max(rank_date)
else min(rank_date) end as selected_rank_date from (
select *, rank() over(partition by clientid, date order by updateDate desc) as rank_date from table1)

And on top of this I will compare the rank_date and selected_rank_date, wherever they are equal, I will select those rows. But unfortunately I am not able to figure out the first query itself, trying since a week.

If there's python way of doing then it should be optimized as the table size is huge approx. to 1 billion records.

CodePudding user response：

#Import the csv in df and try with below code...
grp = df.groupby(['clientid', 'date'], axis=0)
li = []
for i, j in grp:
    j.sort_values(by=['updateDate'], ascending = True)
    fil = j['status'] != 'CO'
    j = j.loc[fil, :].reset_index(drop=True)
    li.append(j.loc[0,:])
pd.DataFrame(li)

CodePudding user response：

One approach to solve your problem in MySQL is the following:

Step 1: get the first and the last updateDate for each partition (client_id, date)
Step 2: get the last updateDate for the group ('NonPay','VD','Active')
Step 3: get the first updateDate for the group ('CO')
Step 4: do a union of the rows for the two groups

Step 1: you can use ROW_NUMBER():

ascendently over updateDate to find first date of the partition where this value equals 1
descendently over updateDate to find last date of the partition where this value equals 1

SELECT *,
       ROW_NUMBER() OVER(PARTITION BY clientid, date
                         ORDER     BY updateDate     ) AS firstUpdateDate,
       ROW_NUMBER() OVER(PARTITION BY clientid, date
                         ORDER     BY updateDate DESC) AS lastUpdateDate
FROM tab

Step 2: when you have one of the statuses ('NonPay','VD','Active') in the last date, you require to retrieve the last date itself, hence getting the corresponding rows means effectively to retrieve the last rows (lastUpdateDate = 1) where the status is one of the previously cited ones.

SELECT rd.rownum, 
       rd.clientid, 
       rd.date, 
       rd.status, 
       rd.updateDate
FROM ranked_dates rd
WHERE rd.lastUpdateDate = 1
  AND rd.status IN ('NonPay', 'VD', 'Active')

Step 3: when you have one status 'CO' in the last date, you require to retrieve the first date, or in other words, from all the first dates that we have, we don't want those rows whose combination of (clientid, date) has already been captured from the Step 2. You can do this with an left join where the left table values are null (those values for which you don't have correspondence in the table generated by Step 2).

SELECT rd.rownum, 
       rd.clientid, 
       rd.date, 
       rd.status, 
       rd.updateDate
FROM       ranked_dates rd
LEFT JOIN np_vd_active_status s 
       ON rd.clientid = s.clientid
      AND rd.date = s.date
WHERE rd.firstUpdateDate = 1
  AND s.rownum IS NULL

Step 4: just apply a union between Step 2 result and Step 3 result. If you want to make some ordering on the rownum field, you can do that easily with an ORDER BY statement.

Final Query:

WITH ranked_dates AS (
    SELECT *,
           ROW_NUMBER() OVER(PARTITION BY clientid, date
                             ORDER     BY updateDate     ) AS firstUpdateDate,
           ROW_NUMBER() OVER(PARTITION BY clientid, date
                             ORDER     BY updateDate DESC) AS lastUpdateDate
    FROM tab  
), np_vd_active_status AS (
    SELECT rd.rownum, 
           rd.clientid, 
           rd.date, 
           rd.status, 
           rd.updateDate
    FROM ranked_dates rd
    WHERE rd.lastUpdateDate = 1
      AND rd.status IN ('NonPay', 'VD', 'Active')
)
SELECT rd.rownum, 
       rd.clientid, 
       rd.date, 
       rd.status, 
       rd.updateDate
FROM       ranked_dates rd
LEFT JOIN np_vd_active_status s 
       ON rd.clientid = s.clientid
      AND rd.date = s.date
WHERE rd.firstUpdateDate = 1
  AND s.rownum IS NULL

UNION

SELECT * 
FROM np_vd_active_status

Try it here.