Fill new rows based on missing column values-CodePudding

I have a data frame like that

week	co_week	Revenue	Cohort_index
19/09/2021	01/10/2021	120	0
19/09/2021	03/10/2021	150	1
19/09/2021	06/10/2021	223	2
19/09/2021	07/10/2021	256	4
19/09/2021	08/10/2021	340	5
20/09/2021	06/10/2021	126	0
20/09/2021	07/10/2021	234	1

Now I'd like to check if one cohort_index is missing (3 in this case) , then insert a new row with the missing index , rest of column values are copied from the previous row while updating the data frame index.

Desired Output :

week	co_week	Revenue	Cohort_index
19/09/2021	01/10/2021	120	0
19/09/2021	03/10/2021	150	1
19/09/2021	06/10/2021	223	2
19/09/2021	06/10/2021	223	3
19/09/2021	07/10/2021	256	4
19/09/2021	08/10/2021	340	5
20/09/2021	06/10/2021	126	0
20/09/2021	07/10/2021	234	1

I can't hard-code the new raw since the data is huge!

new_raw = DataFrame({"week": 19/09/2022, "co_week": 06/10/2021, "Revenue": 223 ,"Cohort_index":3})
df = df.append(new_raw, ignore_index=False)

CodePudding user response：

You can set_index temporarily, reindex with ffill, then reset_index and restore columns in order:

out = (df
  .set_index('Cohort_index')
  .reindex(range(df['Cohort_index'].max() 1), method='ffill')
  .reset_index()[df.columns]
)

output:

         week     co_week  Revenue  Cohort_index
0  19/09/2021  01/10/2021    120.0             0
1  19/09/2021  03/10/2021    150.0             1
2  19/09/2021  06/10/2021    223.0             2
3  19/09/2021  06/10/2021    223.0             3
4  19/09/2021  07/10/2021    256.0             4
5  19/09/2021  08/10/2021    340.0             5

per group

You can do the same in a groupby

out = (df
   .groupby('week', as_index=False)
   .apply(lambda g: g.set_index('Cohort_index')
                     .reindex(range(g['Cohort_index'].max() 1), method='ffill')
                     .reset_index()
          )
   .droplevel(0)[df.columns]
)

output:

         week     co_week  Revenue  Cohort_index
0  19/09/2021  01/10/2021      120             0
1  19/09/2021  03/10/2021      150             1
2  19/09/2021  06/10/2021      223             2
3  19/09/2021  06/10/2021      223             3
4  19/09/2021  07/10/2021      256             4
5  19/09/2021  08/10/2021      340             5
0  20/09/2021  06/10/2021      126             0
1  20/09/2021  07/10/2021      234             1

CodePudding user response：

Use DataFrame.reindex with method='ffill' parameter after convert Cohort_index to index by DataFrame.set_index, last convert index to columns and set original ordering by DataFrame.reindex:

df = (df.set_index('Cohort_index')
        .reindex(range(df['Cohort_index'].max()   1), method='ffill')
        .reset_index()
        .reindex(df.columns, axis=1))
print (df)
         week     co_week  Revenue  Cohort_index
0  19/09/2021  01/10/2021      120             0
1  19/09/2021  03/10/2021      150             1
2  19/09/2021  06/10/2021      223             2
3  19/09/2021  06/10/2021      223             3
4  19/09/2021  07/10/2021      256             4
5  19/09/2021  08/10/2021      340             5

EDIT: Solution working per groups created if difference is less like 0:

g = df['Cohort_index'].diff().lt(0).cumsum()

df = (df.set_index(['Cohort_index'])
        .groupby(g.tolist())
        .apply(lambda x: x.reindex(range(x.index.max()   1), method='ffill'))
        .droplevel(0)
        .reset_index()
        .reindex(df.columns, axis=1)
        )
print (df)
         week     co_week  Revenue  Cohort_index
0  19/09/2021  01/10/2021      120             0
1  19/09/2021  03/10/2021      150             1
2  19/09/2021  06/10/2021      223             2
3  19/09/2021  06/10/2021      223             3
4  19/09/2021  07/10/2021      256             4
5  19/09/2021  08/10/2021      340             5
6  20/09/2021  06/10/2021      126             0
7  20/09/2021  07/10/2021      234             1