I have a data frame like that
week | Revenue | Cohort_index |
---|---|---|
19/09/2021 | 120 | 0 |
19/09/2021 | 150 | 1 |
19/09/2021 | 223 | 2 |
19/09/2021 | 256 | 4 |
20/09/2021 | 340 | 0 |
20/09/2021 | 126 | 1 |
20/09/2021 | 234 | 2 |
Now I'd like to check if cohort_index
is missing (3 in this case for date 19/09/2021 & also 3 for 20/09/2021) , then insert a new row with the missing index.
Maximum number of indices will be decreasing so let's say for 19/09/2021 the maximum number of cohort index is 4 , so the next date 20/09/2021 will have 3 indices.. and I need to fill all missing indices from the minimum to maximum
rest of column values are copied from the previous row except Revenue
that will be filled with 0
while updating the data frame index.
Data is more granular than what I have posted , so for every date and every cohort_index I have different countries and different device types.
Desired Output :
week | Revenue | Cohort_index |
---|---|---|
19/09/2021 | 120 | 0 |
19/09/2021 | 150 | 1 |
19/09/2021 | 223 | 2 |
19/09/2021 | 0 | 3 |
19/09/2021 | 256 | 4 |
20/09/2021 | 340 | 0 |
20/09/2021 | 126 | 1 |
20/09/2021 | 234 | 2 |
20/09/2021 | 0 | 3 |
I think a For loop
is needed , but I can't get my head around it.
CodePudding user response:
A bit more tricky than I'd like but it should work. Could be simplified a little if there was a way to get current group number in groupby apply, is there?
columns = df.columns
max_cohort = 4
df["ngroup"] = df.groupby('week').ngroup()
out = (
df.groupby('week', as_index=False)
.apply(lambda g: g.set_index('Cohort_index')
.reindex(range(max_cohort 1 - max(g["ngroup"])))
)).droplevel(0).reset_index()[columns]
out["week"] = out["week"].fillna(method="ffill")
out["Revenue"] = out["Revenue"].fillna(0)
results is
week Revenue Cohort_index
0 19/09/2021 120.0 0
1 19/09/2021 150.0 1
2 19/09/2021 223.0 2
3 19/09/2021 0.0 3
4 19/09/2021 256.0 4
5 20/09/2021 340.0 0
6 20/09/2021 126.0 1
7 20/09/2021 234.0 2
8 20/09/2021 0.0 3