Most commonly used description use in each row for each unique id-CodePudding

I have a problem. I have articles and these articles have a unique id. The problem is, however, the article description aka article - this is not unique. I would like to try changing the name of the article description article so that there is only one description left. I always want to use the article name that occurs most often.

I have tried something, however I don't know how to access that with .apply(lambda x: ... and write the most used item out to me and then set that as the name.

How can I change the description of the article so that only the most frequently mentioned article description is included?

    id  article            cost
0   1   Bendge             15.30
1   1   Bendge             15.30
2   1   Bendge V2          15.30
3   1   Bendge - volumne 2 15.30
4   5   SEF                14.89
5   1   Bendge             15.30
6   2   DFH                 4.56
7   2   DFH                 4.56
8   2   DFH V2              4.56
9   2   DFH - volumne 2     4.56
10  2   DFH                 4.56

Code

d = {'id': [1, 1, 1, 1, 5, 1, 2, 2, 2, 2, 2],
     'article': ['Bendge', 'Bendge', 'Bendge V2', 'Bendge - volumne 2', 'SEF', 'Bendge',
                 'DFH', 'DFH', 'DFH V2', 'DFH - volumne 2', 'DFH'],
     'cost': [15.30, 15.30, 15.30, 15.30, 14.89, 15.30,
              4.56, 4.56, 4.56, 4.56, 4.56],
    }
df = pd.DataFrame(data=d)
display(df)

df.loc[df['id'] == 1, ['id','article']].value_counts().head(1)
[OUT]
id  article
1   Bendge     3
dtype: int64

df['article'] = df.apply(lambda x: x[['id','article']].value_counts().head(1))
[OUT]
KeyError: "None of [Index(['id', 'article'], dtype='object')] are in the [index]"

What I want

    id  article cost
0   1   Bendge  15.30
1   1   Bendge  15.30
2   1   Bendge  15.30
3   1   Bendge  15.30
4   5   SEF     14.89
5   1   Bendge  15.30
6   2   DFH      4.56
7   2   DFH      4.56
8   2   DFH      4.56
9   2   DFH      4.56
10  2   DFH      4.56

CodePudding user response：

Use GroupBy.transform with lambda function with Series.value_counts and first index:

df['article'] =  df.groupby('id')['article'].transform(lambda x: x.value_counts().index[0])
print (df)
    id article   cost
0    1  Bendge  15.30
1    1  Bendge  15.30
2    1  Bendge  15.30
3    1  Bendge  15.30
4    5     SEF  14.89
5    1  Bendge  15.30
6    2     DFH   4.56
7    2     DFH   4.56
8    2     DFH   4.56
9    2     DFH   4.56
10   2     DFH   4.56

Another solution with Series.mode with first value:

df['article'] = df.groupby('id')['article'].transform(lambda x: x.mode().iat[0])
print (df)
    id article   cost
0    1  Bendge  15.30
1    1  Bendge  15.30
2    1  Bendge  15.30
3    1  Bendge  15.30
4    5     SEF  14.89
5    1  Bendge  15.30
6    2     DFH   4.56
7    2     DFH   4.56
8    2     DFH   4.56
9    2     DFH   4.56
10   2     DFH   4.56

CodePudding user response：

You could use value_counts to get the counts; reset_index pivot idxmax to get the most common article for each id; then map it back to id:

mapping = df.value_counts(['id','article']).reset_index().pivot('id','article', 0).idxmax(axis=1)
df['article'] = df['id'].map(mapping)

Output:

    id article   cost
0    1  Bendge  15.30
1    1  Bendge  15.30
2    1  Bendge  15.30
3    1  Bendge  15.30
4    5     SEF  14.89
5    1  Bendge  15.30
6    2     DFH   4.56
7    2     DFH   4.56
8    2     DFH   4.56
9    2     DFH   4.56
10   2     DFH   4.56

CodePudding user response：

You can use mode:

df['article'] = df['id'].map(df.groupby('id')['article'].agg(pd.Series.mode))

or if a possibility of a tie:

df['article'] = df['id'].map(df.groupby('id')['article'].agg(lambda s: s.mode(s).iloc[0]))

output:

    id article   cost
0    1  Bendge  15.30
1    1  Bendge  15.30
2    1  Bendge  15.30
3    1  Bendge  15.30
4    5     SEF  14.89
5    1  Bendge  15.30
6    2     DFH   4.56
7    2     DFH   4.56
8    2     DFH   4.56
9    2     DFH   4.56
10   2     DFH   4.56

CodePudding user response：

Use:

#preparing data
string = """    c id  article            cost
0   1   Bendge             15.30
1   1   Bendge             15.30
2   1   BendgeV2          15.30
3   1   Bendge-volumne2 15.30
4   5   SEF                14.89
5   1   Bendge             15.30
6   2   DFH                 4.56
7   2   DFH                 4.56
8   2   DFHV2              4.56
9   2   DFH-volumne2     4.56
10  2   DFH                 4.56"""

data = [x.split() for x in string.split('\n')]
import pandas as pd
df = pd.DataFrame(data[1:], columns  = data[0])

#solution
df['article'] = df.groupby('id')['article'].transform(lambda x: x.value_counts().index[0])

Output:

c   id  article cost
0   0   1   Bendge  15.30
1   1   1   Bendge  15.30
2   2   1   Bendge  15.30
3   3   1   Bendge  15.30
4   4   5   SEF 14.89
5   5   1   Bendge  15.30
6   6   2   DFH 4.56
7   7   2   DFH 4.56
8   8   2   DFH 4.56
9   9   2   DFH 4.56
10  10  2   DFH 4.56

Second solution:

from collections import Counter
df.groupby('id')['article'].transform(lambda x: Counter(x).most_common()[0][0])