Home > Mobile >  GroupBy pandas DataFrame and fill/update with most frequent values
GroupBy pandas DataFrame and fill/update with most frequent values

Time:11-25

I'm trying to get the most frequent values in a pandas dataframe and fill/update the data with the most frequent value.

Sample Data

import numpy as np
import pandas as pd

test_input = pd.DataFrame(columns=[ 'key',          'value'],
                          data=  [[ 1,              'A'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              np.nan ],
                                  [ 2,              np.nan ],
                                  [ 3,              'C'    ],
                                  [ 3,              np.nan ],
                                  [ 3,              'D'    ],
                                  [ 3,              'D'    ]])
    key value
0   1   A
1   1   B
2   1   B
3   1   NaN
4   2   NaN
5   3   C
6   3   NaN
7   3   D
8   3   D

get most frequent values based on keys

def mode(df, key_cols, value_col, count_col):

    return (df.groupby(key_cols   [value_col]).size()
             .to_frame(count_col).reset_index()
             .sort_values(count_col, ascending=False)
             .drop_duplicates(subset=key_cols))
freq_df = mode(test_input, ['key'], 'value', 'count')

    key value   count
1   1   B   2
3   3   D   2

How can I fill the most frequent values on the original dataframe

Desired Output

    key value
0   1   B
1   1   B
2   1   B
3   1   B
4   2   NaN
5   3   D
6   3   D
7   3   D
8   3   D

CodePudding user response:

Use GroupBy.transform with custom lambda function with Series.mode and iter with next trick for NaNs if empty mode (because missing value(s)):

test_input['value'] = (test_input.groupby('key')['value']
                                 .transform(lambda x: next(iter(x.mode()), np.nan)))
print (test_input)
   key value
0    1     B
1    1     B
2    1     B
3    1     B
4    2   NaN
5    3     D
6    3     D
7    3     D
8    3     D

Solution with Series.value_counts:

test_input['value'] = (test_input.groupby('key')['value']
                           .transform(lambda x: next(iter(x.value_counts().index), np.nan)))
print (test_input)
   key value
0    1     B
1    1     B
2    1     B
3    1     B
4    2   NaN
5    3     D
6    3     D
7    3     D
8    3     D
  • Related