I'm trying to get the most frequent values in a pandas dataframe and fill/update the data with the most frequent value.
Sample Data
import numpy as np
import pandas as pd
test_input = pd.DataFrame(columns=[ 'key', 'value'],
data= [[ 1, 'A' ],
[ 1, 'B' ],
[ 1, 'B' ],
[ 1, np.nan ],
[ 2, np.nan ],
[ 3, 'C' ],
[ 3, np.nan ],
[ 3, 'D' ],
[ 3, 'D' ]])
key value
0 1 A
1 1 B
2 1 B
3 1 NaN
4 2 NaN
5 3 C
6 3 NaN
7 3 D
8 3 D
get most frequent values based on keys
def mode(df, key_cols, value_col, count_col):
return (df.groupby(key_cols [value_col]).size()
.to_frame(count_col).reset_index()
.sort_values(count_col, ascending=False)
.drop_duplicates(subset=key_cols))
freq_df = mode(test_input, ['key'], 'value', 'count')
key value count
1 1 B 2
3 3 D 2
How can I fill the most frequent values on the original dataframe
Desired Output
key value
0 1 B
1 1 B
2 1 B
3 1 B
4 2 NaN
5 3 D
6 3 D
7 3 D
8 3 D
CodePudding user response:
Use GroupBy.transform
with custom lambda function with Series.mode
and iter
with next
trick for NaN
s if empty mode
(because missing value(s)):
test_input['value'] = (test_input.groupby('key')['value']
.transform(lambda x: next(iter(x.mode()), np.nan)))
print (test_input)
key value
0 1 B
1 1 B
2 1 B
3 1 B
4 2 NaN
5 3 D
6 3 D
7 3 D
8 3 D
Solution with Series.value_counts
:
test_input['value'] = (test_input.groupby('key')['value']
.transform(lambda x: next(iter(x.value_counts().index), np.nan)))
print (test_input)
key value
0 1 B
1 1 B
2 1 B
3 1 B
4 2 NaN
5 3 D
6 3 D
7 3 D
8 3 D