I am willing to filter the rows based on the group of client id, date
So in the group if the latest status as per update date = 'CO' then earliest rows as per update date and if the latest status in ('NonPay','VD','Active' then the latest row as per update date. Table look like: table1
rownum | clientid | date | status | updateDate |
---|---|---|---|---|
1 | 1234 | 2021-02-01 | CO | 2021-02-01 |
2 | 1234 | 2021-02-01 | CO | 2021-01-01 |
3 | 1234 | 2021-02-01 | NonPay | 2020-12-01 |
4 | 1234 | 2021-02-03 | Active | 2021-11-01 |
5 | 1234 | 2021-02-03 | CO | 2021-10-01 |
6 | 1234 | 2021-02-03 | CO | 2021-09-01 |
7 | 1234 | 2021-02-04 | CO | 2021-08-01 |
8 | 1234 | 2021-02-04 | VD | 2021-07-01 |
9 | 4567 | 2019-06-01 | Active | 2020-12-28 |
10 | 4567 | 2019-06-01 | CO | 2020-12-20 |
11 | 4567 | 2019-06-01 | NonPay | 2020-12-10 |
12 | 4567 | 2019-05-03 | VD | 2020-12-01 |
13 | 4567 | 2019-05-03 | Active | 2020-11-01 |
14 | 4567 | 2019-05-03 | CO | 2020-10-01 |
15 | 4567 | 2019-05-03 | NP | 2020-09-01 |
16 | 4567 | 2019-04-04 | CO | 2020-08-01 |
17 | 4567 | 2019-04-04 | VD | 2020-07-01 |
So the expected result would look like :
rownum | clientid | date | status | updateDate |
---|---|---|---|---|
3 | 1234 | 2021-02-01 | NonPay | 2020-12-01 |
4 | 1234 | 2021-02-03 | Active | 2021-11-01 |
8 | 1234 | 2021-02-04 | VD | 2021-07-01 |
9 | 4567 | 2019-06-01 | Active | 2020-02-01 |
12 | 4567 | 2019-05-03 | VD | 2020-12-01 |
17 | 4567 | 2019-04-04 | VD | 2020-07-01 |
I tried :
select *,
case when rank_date=1 and status=!='Active' then max(rank_date)
else min(rank_date) end as selected_rank_date from (
select *, rank() over(partition by clientid, date order by updateDate desc) as rank_date from table1)
And on top of this I will compare the rank_date and selected_rank_date, wherever they are equal, I will select those rows. But unfortunately I am not able to figure out the first query itself, trying since a week.
If there's python way of doing then it should be optimized as the table size is huge approx. to 1 billion records.
CodePudding user response:
#Import the csv in df and try with below code...
grp = df.groupby(['clientid', 'date'], axis=0)
li = []
for i, j in grp:
j.sort_values(by=['updateDate'], ascending = True)
fil = j['status'] != 'CO'
j = j.loc[fil, :].reset_index(drop=True)
li.append(j.loc[0,:])
pd.DataFrame(li)
CodePudding user response:
One approach to solve your problem in MySQL is the following:
- Step 1: get the first and the last
updateDate
for each partition(client_id, date)
- Step 2: get the last
updateDate
for the group('NonPay','VD','Active')
- Step 3: get the first
updateDate
for the group('CO')
- Step 4: do a union of the rows for the two groups
Step 1: you can use ROW_NUMBER()
:
- ascendently over
updateDate
to find first date of the partition where this value equals 1 - descendently over
updateDate
to find last date of the partition where this value equals 1
SELECT *,
ROW_NUMBER() OVER(PARTITION BY clientid, date
ORDER BY updateDate ) AS firstUpdateDate,
ROW_NUMBER() OVER(PARTITION BY clientid, date
ORDER BY updateDate DESC) AS lastUpdateDate
FROM tab
Step 2: when you have one of the statuses ('NonPay','VD','Active')
in the last date, you require to retrieve the last date itself, hence getting the corresponding rows means effectively to retrieve the last rows (lastUpdateDate = 1
) where the status is one of the previously cited ones.
SELECT rd.rownum,
rd.clientid,
rd.date,
rd.status,
rd.updateDate
FROM ranked_dates rd
WHERE rd.lastUpdateDate = 1
AND rd.status IN ('NonPay', 'VD', 'Active')
Step 3: when you have one status 'CO'
in the last date, you require to retrieve the first date, or in other words, from all the first dates that we have, we don't want those rows whose combination of (clientid
, date
) has already been captured from the Step 2. You can do this with an left join where the left table values are null (those values for which you don't have correspondence in the table generated by Step 2).
SELECT rd.rownum,
rd.clientid,
rd.date,
rd.status,
rd.updateDate
FROM ranked_dates rd
LEFT JOIN np_vd_active_status s
ON rd.clientid = s.clientid
AND rd.date = s.date
WHERE rd.firstUpdateDate = 1
AND s.rownum IS NULL
Step 4: just apply a union between Step 2 result and Step 3 result. If you want to make some ordering on the rownum
field, you can do that easily with an ORDER BY
statement.
Final Query:
WITH ranked_dates AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY clientid, date
ORDER BY updateDate ) AS firstUpdateDate,
ROW_NUMBER() OVER(PARTITION BY clientid, date
ORDER BY updateDate DESC) AS lastUpdateDate
FROM tab
), np_vd_active_status AS (
SELECT rd.rownum,
rd.clientid,
rd.date,
rd.status,
rd.updateDate
FROM ranked_dates rd
WHERE rd.lastUpdateDate = 1
AND rd.status IN ('NonPay', 'VD', 'Active')
)
SELECT rd.rownum,
rd.clientid,
rd.date,
rd.status,
rd.updateDate
FROM ranked_dates rd
LEFT JOIN np_vd_active_status s
ON rd.clientid = s.clientid
AND rd.date = s.date
WHERE rd.firstUpdateDate = 1
AND s.rownum IS NULL
UNION
SELECT *
FROM np_vd_active_status
Try it here.