I have a df like this
id val1 val2 val3
0 1 1 2
1 1 NaN 2
2 1 4 2
3 1 4 2
4 2 1 1
5 3 NaN 3
6 3 7 3
7 3 7 3
then
temp_df = df.loc[df.duplicated(subset=['val1','val3'], keep=False)]
gives me this
id val1 val2 val3
0 1 1 2
1 1 NaN 2
2 1 4 2
3 1 4 2
5 3 NaN 3
6 3 7 3
7 3 7 3
How can I iterate over each partition/group containing the duplicate values?
for partition in temp_df......:
print(partition)
id val1 val2 val3
0 1 1 2
1 1 NaN 2
2 1 4 2
3 1 4 2
id val1 val2 val3
5 3 NaN 3
6 3 7 3
7 3 7 3
The goal is to impute the NaN value with the mode of the partition columns. E.g mode(1, 4, 4) = 4
so I want to fill in the NaN value of the first partition with 4. Similarly, I want to fill in the NaN value of the second partition with 7.
CodePudding user response:
Update
Use groupby_apply
:
df['val2'] = df.groupby(['val1', 'val3'])['val2'] \
.apply(lambda x: x.fillna(x.mode().squeeze()))
print(df)
# Output:
id val1 val2 val3
0 0 1 1.0 2
1 1 1 4.0 2
2 2 1 4.0 2
3 3 1 4.0 2
4 4 2 1.0 1
5 5 3 7.0 3
6 6 3 7.0 3
7 7 3 7.0 3
Old answer
IIUC, use groupby
after sorting dataframe by val2
then fill forward:
df['val2'] = df.sort_values('val2').groupby(['val1', 'val3'])['val2'].ffill()
print(df)
# Output:
id val1 val2 val3
0 0 1 1.1 2.2
1 1 1 1.1 2.2
2 3 2 1.3 1.0
3 4 3 1.5 6.2
4 5 3 1.5 6.2