I have the following dataframe (df). Col A is a category and Col B is an item in that category.
Col A | Col B |
---|---|
red | car |
red | car |
red | truck |
red | ball |
blue | bus |
blue | bus |
blue | bus |
blue | truck |
blue | car |
I want to get another dataframe (df2) showing the total count of distinct categories in Col A, followed by the count highest occurring item in Col B corresponding to Col A as below:
Col A | Count A | Col B | Count B |
---|---|---|---|
red | 4 | car | 2 |
blue | 5 | bus | 2 |
Any idea on how to generate this dataframe?
I have tried this command:
df2 = df.groupby('Col A')['Col B'].apply(lambda x: x.value_counts().index[0]).reset_index()
and I get the following result:
Col A | Col B |
---|---|
red | car |
blue | bus |
I don't know how to get the two counts column. Any ideas?
CodePudding user response:
You can use Counter
from Collections:
from collections import Counter
final = df.groupby(['Col A']).agg({'Col A':'count','Col B':list})
'''
Col A Col B
Col A
blue 5 [bus, bus, bus, truck, car]
red 4 [car, car, truck, ball]
'''
final['Col_b'] = final['Col B'].apply(lambda x: Counter(x).most_common(1)[0][0]) #get most common value
final['Count_b'] = final['Col B'].apply(lambda x: Counter(x).most_common(1)[0][1]) #get count of most common value
final=final.drop('Col B',axis=1).rename(columns={'Col A':'Count A'}).reset_index()
Output:
| | Col A | Count A | Col_b | Count_b |
|---:|:--------|----------:|:--------|----------:|
| 0 | blue | 5 | bus | 3 |
| 1 | red | 4 | car | 2 |