conditional replacing with mode value-CodePudding

I've done already lot of searching but none of the tips I found was the expected answer.

Here is my df structure.

year	name	value
2000	Mick	a
2001	Mick	a
2002	Mick	ab
2003	Mick	b
2000	Jane	c
2001	Jane	c
2002	Jane	cd
2003	Jane	d

And a list of values to replace:

values_i_do_not_want = ['ab', 'cd']

I'd like to replace a value_i_do_not_want with a mode for each name from df['name'] column.

I like to receive a final df:

year	name	value
2000	Mick	a
2001	Mick	a
2002	Mick	a
2003	Mick	b
2000	Jane	c
2001	Jane	c
2002	Jane	c
2003	Jane	d

That is what I found closest to my expectations. I couldn't implement condition to code presented there.

CodePudding user response：

You can use groupby.transform combined with mode to get the most frequent value, then use boolean indexing to replace the values:

df.loc[df['value'].isin(values_i_do_not_want),
       'value'] = (df.groupby('name')['value']
                     .transform(lambda x: x.mode()[0])
                   )

Output:

   year  name value
0  2000  Mick     a
1  2001  Mick     a
2  2002  Mick     a
3  2003  Mick     b
4  2000  Jane     c
5  2001  Jane     c
6  2002  Jane     c
7  2003  Jane     d

CodePudding user response：

You can try groupby and transform

df['value'] = (df.groupby('name')['value']
               .transform(lambda col: col.replace(values_i_do_not_want, [col.mode()]*len(values_i_do_not_want))))

print(df)

   year  name value
0  2000  Mick     a
1  2001  Mick     a
2  2002  Mick     a
3  2003  Mick     b
4  2000  Jane     c
5  2001  Jane     c
6  2002  Jane     c
7  2003  Jane     d

CodePudding user response：

You can use df.apply to check if the value of each row is in values_i_do_not_want and then find the mode of the values with the same name like this:

import pandas

df = pandas.DataFrame({
    'year': [2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003],
    'name': ['Mick', 'Mick', 'Mick', 'Mick', 'Jane', 'Jane', 'Jane', 'Jane'],
    'value': ['a', 'a', 'ab', 'b', 'c', 'c', 'cd', 'd'],
})

values_i_do_not_want = ['ab', 'cd']

df = df.assign(
    value=df.apply(
        lambda x: (
            x.value 
            if x.value not in values_i_do_not_want 
            else df.loc[df['name'] == x['name'], 'value'].mode().iloc[0]
        ),
        axis=1
    )
)

output:

>>> df
   year  name value
0  2000  Mick     a
1  2001  Mick     a
2  2002  Mick     a
3  2003  Mick     b
4  2000  Jane     c
5  2001  Jane     c
6  2002  Jane     c
7  2003  Jane     d