I have a data frame like that
week | co_week | Revenue | Cohort_index |
---|---|---|---|
19/09/2021 | 01/10/2021 | 120 | 0 |
19/09/2021 | 03/10/2021 | 150 | 1 |
19/09/2021 | 06/10/2021 | 223 | 2 |
19/09/2021 | 07/10/2021 | 256 | 4 |
19/09/2021 | 08/10/2021 | 340 | 5 |
20/09/2021 | 06/10/2021 | 126 | 0 |
20/09/2021 | 07/10/2021 | 234 | 1 |
Now I'd like to check if one cohort_index
is missing (3 in this case) , then insert a new row with the missing index , rest of column values are copied from the previous row while updating the data frame index.
Desired Output :
week | co_week | Revenue | Cohort_index |
---|---|---|---|
19/09/2021 | 01/10/2021 | 120 | 0 |
19/09/2021 | 03/10/2021 | 150 | 1 |
19/09/2021 | 06/10/2021 | 223 | 2 |
19/09/2021 | 06/10/2021 | 223 | 3 |
19/09/2021 | 07/10/2021 | 256 | 4 |
19/09/2021 | 08/10/2021 | 340 | 5 |
20/09/2021 | 06/10/2021 | 126 | 0 |
20/09/2021 | 07/10/2021 | 234 | 1 |
I can't hard-code the new raw since the data is huge!
new_raw = DataFrame({"week": 19/09/2022, "co_week": 06/10/2021, "Revenue": 223 ,"Cohort_index":3})
df = df.append(new_raw, ignore_index=False)
CodePudding user response:
You can set_index
temporarily, reindex
with ffill
, then reset_index
and restore columns in order:
out = (df
.set_index('Cohort_index')
.reindex(range(df['Cohort_index'].max() 1), method='ffill')
.reset_index()[df.columns]
)
output:
week co_week Revenue Cohort_index
0 19/09/2021 01/10/2021 120.0 0
1 19/09/2021 03/10/2021 150.0 1
2 19/09/2021 06/10/2021 223.0 2
3 19/09/2021 06/10/2021 223.0 3
4 19/09/2021 07/10/2021 256.0 4
5 19/09/2021 08/10/2021 340.0 5
per group
You can do the same in a groupby
out = (df
.groupby('week', as_index=False)
.apply(lambda g: g.set_index('Cohort_index')
.reindex(range(g['Cohort_index'].max() 1), method='ffill')
.reset_index()
)
.droplevel(0)[df.columns]
)
output:
week co_week Revenue Cohort_index
0 19/09/2021 01/10/2021 120 0
1 19/09/2021 03/10/2021 150 1
2 19/09/2021 06/10/2021 223 2
3 19/09/2021 06/10/2021 223 3
4 19/09/2021 07/10/2021 256 4
5 19/09/2021 08/10/2021 340 5
0 20/09/2021 06/10/2021 126 0
1 20/09/2021 07/10/2021 234 1
CodePudding user response:
Use DataFrame.reindex
with method='ffill'
parameter after convert Cohort_index
to index by DataFrame.set_index
, last convert index
to columns and set original ordering by DataFrame.reindex
:
df = (df.set_index('Cohort_index')
.reindex(range(df['Cohort_index'].max() 1), method='ffill')
.reset_index()
.reindex(df.columns, axis=1))
print (df)
week co_week Revenue Cohort_index
0 19/09/2021 01/10/2021 120 0
1 19/09/2021 03/10/2021 150 1
2 19/09/2021 06/10/2021 223 2
3 19/09/2021 06/10/2021 223 3
4 19/09/2021 07/10/2021 256 4
5 19/09/2021 08/10/2021 340 5
EDIT: Solution working per groups created if difference is less like 0
:
g = df['Cohort_index'].diff().lt(0).cumsum()
df = (df.set_index(['Cohort_index'])
.groupby(g.tolist())
.apply(lambda x: x.reindex(range(x.index.max() 1), method='ffill'))
.droplevel(0)
.reset_index()
.reindex(df.columns, axis=1)
)
print (df)
week co_week Revenue Cohort_index
0 19/09/2021 01/10/2021 120 0
1 19/09/2021 03/10/2021 150 1
2 19/09/2021 06/10/2021 223 2
3 19/09/2021 06/10/2021 223 3
4 19/09/2021 07/10/2021 256 4
5 19/09/2021 08/10/2021 340 5
6 20/09/2021 06/10/2021 126 0
7 20/09/2021 07/10/2021 234 1