Home > OS >  Pandas: compare df and add missing rows
Pandas: compare df and add missing rows

Time:09-23

I have a list of dataframes which have 1 column in common ('label'). However, in some of the dataframes some rows are missing. Example: df1 = pd.DataFrame([['sample1',2,3], ['sample4',7,8]], columns=['label', 'B', 'E'], index=[1,2]) df2 = pd.DataFrame([['sample1',20,30], ['sample2',70,80], ['sample3',700,800]], columns=['label', 'B', 'C'], index=[2,3,4]) I would like to add rows, so the length of the dfs are the same but preserving the right order. The desired output would be:

     label  B  E
1  sample1  2  3
2        0  0  0
3        0  0  0
4  sample4  7  8

     label    B    C
1  sample1   20   30
2  sample2   70   80
3  sample3  700  800
4  0          0    0

I was looking into pandas three-way joining multiple dataframes on columns but I don't want to merge my dataframes. And pandas align() function : illustrative example doesn't give the desired output either. I was also thinking about comparing the 'label' column with a list and loop through to add the missing rows. If somebody could point me into the right direction, that would be great.

CodePudding user response:

You can get the common indices in the desired order, then reindex:

# here the order matters to get the preference
# for a sorted order use:
# unique = sorted(pd.concat([df1['label'], df2['label']]).unique())
unique = pd.concat([df2['label'], df1['label']]).unique()

out1 = (df1.set_axis(df1['label'])
           .reindex(unique, fill_value=0)
           .reset_index(drop=True)
       )

out2 = (df2.set_axis(df2['label'])
           .reindex(unique, fill_value=0)
           .reset_index(drop=True)
       )

outputs:

# out1
     label  B  E
0  sample1  2  3
1        0  0  0
2        0  0  0
3  sample4  7  8

# out2
     label    B    C
0  sample1   20   30
1  sample2   70   80
2  sample3  700  800
3        0    0    0
  • Related