Home > Software engineering >  Iterate over duplicate partitions/groups of a Pandas DataFrame
Iterate over duplicate partitions/groups of a Pandas DataFrame

Time:11-12

I have a df like this

id val1 val2 val3
0   1   1     2
1   1   NaN   2
2   1   4     2
3   1   4     2
4   2   1     1
5   3   NaN   3
6   3   7     3
7   3   7     3

then

temp_df = df.loc[df.duplicated(subset=['val1','val3'], keep=False)]

gives me this

id val1 val2 val3
0   1   1     2
1   1   NaN   2
2   1   4     2
3   1   4     2
5   3   NaN   3
6   3   7     3
7   3   7     3

How can I iterate over each partition/group containing the duplicate values?

for partition in temp_df......:
    print(partition)

id val1 val2 val3
0   1   1     2
1   1   NaN   2
2   1   4     2
3   1   4     2

id val1 val2 val3
5   3   NaN   3
6   3   7     3
7   3   7     3

The goal is to impute the NaN value with the mode of the partition columns. E.g mode(1, 4, 4) = 4 so I want to fill in the NaN value of the first partition with 4. Similarly, I want to fill in the NaN value of the second partition with 7.

CodePudding user response:

Update

Use groupby_apply:

df['val2'] = df.groupby(['val1', 'val3'])['val2'] \
               .apply(lambda x: x.fillna(x.mode().squeeze()))
print(df)

# Output:
   id  val1  val2  val3
0   0     1   1.0     2
1   1     1   4.0     2
2   2     1   4.0     2
3   3     1   4.0     2
4   4     2   1.0     1
5   5     3   7.0     3
6   6     3   7.0     3
7   7     3   7.0     3

Old answer

IIUC, use groupby after sorting dataframe by val2 then fill forward:

df['val2'] = df.sort_values('val2').groupby(['val1', 'val3'])['val2'].ffill()
print(df)

# Output:
   id  val1  val2  val3
0   0     1   1.1   2.2
1   1     1   1.1   2.2
2   3     2   1.3   1.0
3   4     3   1.5   6.2
4   5     3   1.5   6.2
  • Related