I have a problem. I have articles and these articles have a unique id
. The problem is, however, the article
description aka article
- this is not unique.
I would like to try changing the name of the article
description article
so that there is only one description left. I always want to use the article
name that occurs most often.
I have tried something, however I don't know how to access that with .apply(lambda x: ...
and write the most used item out to me and then set that as the name.
How can I change the description of the article so that only the most frequently mentioned article description is included?
id article cost
0 1 Bendge 15.30
1 1 Bendge 15.30
2 1 Bendge V2 15.30
3 1 Bendge - volumne 2 15.30
4 5 SEF 14.89
5 1 Bendge 15.30
6 2 DFH 4.56
7 2 DFH 4.56
8 2 DFH V2 4.56
9 2 DFH - volumne 2 4.56
10 2 DFH 4.56
Code
d = {'id': [1, 1, 1, 1, 5, 1, 2, 2, 2, 2, 2],
'article': ['Bendge', 'Bendge', 'Bendge V2', 'Bendge - volumne 2', 'SEF', 'Bendge',
'DFH', 'DFH', 'DFH V2', 'DFH - volumne 2', 'DFH'],
'cost': [15.30, 15.30, 15.30, 15.30, 14.89, 15.30,
4.56, 4.56, 4.56, 4.56, 4.56],
}
df = pd.DataFrame(data=d)
display(df)
df.loc[df['id'] == 1, ['id','article']].value_counts().head(1)
[OUT]
id article
1 Bendge 3
dtype: int64
df['article'] = df.apply(lambda x: x[['id','article']].value_counts().head(1))
[OUT]
KeyError: "None of [Index(['id', 'article'], dtype='object')] are in the [index]"
What I want
id article cost
0 1 Bendge 15.30
1 1 Bendge 15.30
2 1 Bendge 15.30
3 1 Bendge 15.30
4 5 SEF 14.89
5 1 Bendge 15.30
6 2 DFH 4.56
7 2 DFH 4.56
8 2 DFH 4.56
9 2 DFH 4.56
10 2 DFH 4.56
CodePudding user response:
Use GroupBy.transform
with lambda function with Series.value_counts
and first index:
df['article'] = df.groupby('id')['article'].transform(lambda x: x.value_counts().index[0])
print (df)
id article cost
0 1 Bendge 15.30
1 1 Bendge 15.30
2 1 Bendge 15.30
3 1 Bendge 15.30
4 5 SEF 14.89
5 1 Bendge 15.30
6 2 DFH 4.56
7 2 DFH 4.56
8 2 DFH 4.56
9 2 DFH 4.56
10 2 DFH 4.56
Another solution with Series.mode
with first value:
df['article'] = df.groupby('id')['article'].transform(lambda x: x.mode().iat[0])
print (df)
id article cost
0 1 Bendge 15.30
1 1 Bendge 15.30
2 1 Bendge 15.30
3 1 Bendge 15.30
4 5 SEF 14.89
5 1 Bendge 15.30
6 2 DFH 4.56
7 2 DFH 4.56
8 2 DFH 4.56
9 2 DFH 4.56
10 2 DFH 4.56
CodePudding user response:
You could use value_counts
to get the counts; reset_index
pivot
idxmax
to get the most common article
for each id
; then map
it back to id
:
mapping = df.value_counts(['id','article']).reset_index().pivot('id','article', 0).idxmax(axis=1)
df['article'] = df['id'].map(mapping)
Output:
id article cost
0 1 Bendge 15.30
1 1 Bendge 15.30
2 1 Bendge 15.30
3 1 Bendge 15.30
4 5 SEF 14.89
5 1 Bendge 15.30
6 2 DFH 4.56
7 2 DFH 4.56
8 2 DFH 4.56
9 2 DFH 4.56
10 2 DFH 4.56
CodePudding user response:
You can use mode
:
df['article'] = df['id'].map(df.groupby('id')['article'].agg(pd.Series.mode))
or if a possibility of a tie:
df['article'] = df['id'].map(df.groupby('id')['article'].agg(lambda s: s.mode(s).iloc[0]))
output:
id article cost
0 1 Bendge 15.30
1 1 Bendge 15.30
2 1 Bendge 15.30
3 1 Bendge 15.30
4 5 SEF 14.89
5 1 Bendge 15.30
6 2 DFH 4.56
7 2 DFH 4.56
8 2 DFH 4.56
9 2 DFH 4.56
10 2 DFH 4.56
CodePudding user response:
Use:
#preparing data
string = """ c id article cost
0 1 Bendge 15.30
1 1 Bendge 15.30
2 1 BendgeV2 15.30
3 1 Bendge-volumne2 15.30
4 5 SEF 14.89
5 1 Bendge 15.30
6 2 DFH 4.56
7 2 DFH 4.56
8 2 DFHV2 4.56
9 2 DFH-volumne2 4.56
10 2 DFH 4.56"""
data = [x.split() for x in string.split('\n')]
import pandas as pd
df = pd.DataFrame(data[1:], columns = data[0])
#solution
df['article'] = df.groupby('id')['article'].transform(lambda x: x.value_counts().index[0])
Output:
c id article cost
0 0 1 Bendge 15.30
1 1 1 Bendge 15.30
2 2 1 Bendge 15.30
3 3 1 Bendge 15.30
4 4 5 SEF 14.89
5 5 1 Bendge 15.30
6 6 2 DFH 4.56
7 7 2 DFH 4.56
8 8 2 DFH 4.56
9 9 2 DFH 4.56
10 10 2 DFH 4.56
Second solution:
from collections import Counter
df.groupby('id')['article'].transform(lambda x: Counter(x).most_common()[0][0])