I have some initial data that looks like this:
code type value
1111 Golf Acceptable
1111 Golf Undesirable
1111 Basketball Acceptable
1111 Basketball Undesirable
1111 Basketball Undesirable
and I'm trying to group it on the code
and type
columns to get the row with the most occurrences. In the case of a tie, I want to select the row with the value Undesirable
. So the example above would become this:
code type value
1111 Golf Undesirable
1111 Basketball Undesirable
Currently I'm doing it this way:
df = pd.DataFrame(df.groupby(['code', 'type', 'value']).size().reset_index(name='count'))
df = df.sort_values(['type', 'count'])
df = pd.DataFrame(df.groupby(['code', 'type']).last().reset_index())
I've done some testing of this and it seems to do what I want, but I don't really like trusting the .last()
call, and hoping in the case of a tie that Undesirable
was sorted last. Is there a better way to group this to ensure I always get the higher count, or in the cases of a tie select the Undesirable
value?
Performance isn't too much of an issue as I'm only working with around 50k rows or so.
CodePudding user response:
Case 1
If the value
column only contains two values i.e. ['Acceptable', 'Undesirable']
then we can rely on the fact that Acceptable
< Undesirable
alphabetically. In this case you can use the following simplified solution.
Create an auxiliary column called count
which contain the count of number of rows per code
, type
and value
. Then sort the dataframe by count
and value
and drop the dupes per code
and type
keeping the last row.
c = ['code', 'type']
df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')
Case 2
If the value
column contains other values and you can't rely on alphabetical ordering use the following solution which is similar to solution proposed in case 1 but this first converts the value column to ordered Categorical
type before sorting
c = ['code', 'type']
df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df['value'] = pd.Categorical(df['value'], categories=['Acceptable', 'Undesirable'], ordered=True)
df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')
Result
code type value count
1 1111 Golf Undesirable 1
4 1111 Basketball Undesirable 2
CodePudding user response:
Another possible solution, which is based on the following ideas:
Grouping the data by
code
andtype
.If a group has more than one row (
len(x) > 1
) and its rows have the same count (x['count'] == x['count'].min()).all()
), return the row withUndesirable
.Otherwise, return the row where the count is maximum (
x.iloc[[x['count'].argmax()]]
).
(df.groupby(['code', 'type', 'value'])['value'].size()
.reset_index(name='count').groupby(['code', 'type'])
.apply(lambda x: x.loc[x['value'] == 'Undesirable'] if
((len(x) > 1) and (x['count'] == x['count'].min()).all()) else
x.iloc[[x['count'].argmax()]])
.reset_index(drop=True)
.drop('count', axis=1))
Output:
code type value
0 1111 Basketball Undesirable
1 1111 Golf Undesirable