I've done already lot of searching but none of the tips I found was the expected answer.
Here is my df structure.
year | name | value |
---|---|---|
2000 | Mick | a |
2001 | Mick | a |
2002 | Mick | ab |
2003 | Mick | b |
2000 | Jane | c |
2001 | Jane | c |
2002 | Jane | cd |
2003 | Jane | d |
And a list of values to replace:
values_i_do_not_want = ['ab', 'cd']
I'd like to replace a value_i_do_not_want with a mode for each name from df['name'] column.
I like to receive a final df:
year | name | value |
---|---|---|
2000 | Mick | a |
2001 | Mick | a |
2002 | Mick | a |
2003 | Mick | b |
2000 | Jane | c |
2001 | Jane | c |
2002 | Jane | c |
2003 | Jane | d |
That is what I found closest to my expectations. I couldn't implement condition to code presented there.
CodePudding user response:
You can use groupby.transform
combined with mode
to get the most frequent value, then use boolean indexing to replace the values:
df.loc[df['value'].isin(values_i_do_not_want),
'value'] = (df.groupby('name')['value']
.transform(lambda x: x.mode()[0])
)
Output:
year name value
0 2000 Mick a
1 2001 Mick a
2 2002 Mick a
3 2003 Mick b
4 2000 Jane c
5 2001 Jane c
6 2002 Jane c
7 2003 Jane d
CodePudding user response:
You can try groupby and transform
df['value'] = (df.groupby('name')['value']
.transform(lambda col: col.replace(values_i_do_not_want, [col.mode()]*len(values_i_do_not_want))))
print(df)
year name value
0 2000 Mick a
1 2001 Mick a
2 2002 Mick a
3 2003 Mick b
4 2000 Jane c
5 2001 Jane c
6 2002 Jane c
7 2003 Jane d
CodePudding user response:
You can use df.apply
to check if the value of each row is in values_i_do_not_want
and then find the mode of the values with the same name like this:
import pandas
df = pandas.DataFrame({
'year': [2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003],
'name': ['Mick', 'Mick', 'Mick', 'Mick', 'Jane', 'Jane', 'Jane', 'Jane'],
'value': ['a', 'a', 'ab', 'b', 'c', 'c', 'cd', 'd'],
})
values_i_do_not_want = ['ab', 'cd']
df = df.assign(
value=df.apply(
lambda x: (
x.value
if x.value not in values_i_do_not_want
else df.loc[df['name'] == x['name'], 'value'].mode().iloc[0]
),
axis=1
)
)
output:
>>> df
year name value
0 2000 Mick a
1 2001 Mick a
2 2002 Mick a
3 2003 Mick b
4 2000 Jane c
5 2001 Jane c
6 2002 Jane c
7 2003 Jane d