I'm looking to fill in a dataframe with a missing row based on a few criteria.
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 keyA 39686 non-null object
1 keyB 39686 non-null int64
2 keyC 39686 non-null object
3 keyD 39686 non-null object
4 snapshot_week 39686 non-null datetime64[ns]
5 metric1 39686 non-null int64
6 metric2 39686 non-null int64
dtypes: datetime64[ns](1), int64(1), object(5)
A1/B1/C1/D1 has data missing for 2022-08-20, and A3/B3/C3/D3 is missing 2022-08-27.
[['A1','B1','C1','D1','2022-08-27',5000,5000],
['A2','B2','C2','D2','2022-08-20',4278,4278],
['A2','B2','C2','D2','2022-08-27',6852,6852],
['A3','B3','C3','D3','2022-08-20',9587,9587]]
I'd like to update the data frame with the following extra rows, using 0 as the default value for metric1/metric2.
[['A1','B1','C1','D1','2022-08-20',0,0], # <---
['A1','B1','C1','D1','2022-08-27',5000,5000],
['A2','B2','C2','D2','2022-08-20',4278,4278],
['A2','B2','C2','D2','2022-08-27',6852,6852],
['A3','B3','C3','D3','2022-08-20',9587,9587],
['A3','B3','C3','D3','2022-08-27',0,0]] # <--
I've tried a few different techniques such as reindex
, asfreq
, groupby
but I have yet to achieve the desired results. The snapshot_week values will only be on a Saturday, and only two dates will ever be present at a given time. Not all key permutations are needed, and some keys are numeric identifiers. Essentially, I just need to have the week over week record for the key combination with metrics defaulting to 0.
Thanks in advance for the support!
CodePudding user response:
Try as follows. First, we use df.pivot
to "group" all values (per each week) for each key*
-sequence (as index). Next, we use df.stack
with dropna
parameter set to False
to get rows for all key*
-sequences per week. Finally, we reset the index and replace the NaNs with zeros.
import pandas as pd
data = [['A1','B1','C1','D1','2022-08-27',5000,5000],
['A2','B2','C2','D2','2022-08-20',4278,4278],
['A2','B2','C2','D2','2022-08-27',6852,6852],
['A3','B3','C3','D3','2022-08-20',9587,9587]]
cols = ['keyA','keyB','keyC','keyD','snapshot_week', 'metric1', 'metric2']
df = pd.DataFrame(data, columns=cols)
df_new = df.pivot(index=['keyA','keyB','keyC','keyD'],
columns=['snapshot_week'],
values=['metric1','metric2'])\
.stack(dropna=False).reset_index(drop=False).fillna(0)
print(df_new)
keyA keyB keyC keyD snapshot_week metric1 metric2
0 A1 B1 C1 D1 2022-08-20 0.0 0.0
1 A1 B1 C1 D1 2022-08-27 5000.0 5000.0
2 A2 B2 C2 D2 2022-08-20 4278.0 4278.0
3 A2 B2 C2 D2 2022-08-27 6852.0 6852.0
4 A3 B3 C3 D3 2022-08-20 9587.0 9587.0
5 A3 B3 C3 D3 2022-08-27 0.0 0.0