Home > Software design >  Pandas Group By and Sorting by multiple columns
Pandas Group By and Sorting by multiple columns

Time:09-27

I have some initial data that looks like this:

code          type          value
1111          Golf     Acceptable
1111          Golf    Undesirable
1111    Basketball     Acceptable
1111    Basketball    Undesirable
1111    Basketball    Undesirable

and I'm trying to group it on the code and type columns to get the row with the most occurrences. In the case of a tie, I want to select the row with the value Undesirable. So the example above would become this:

code          type          value
1111          Golf    Undesirable
1111    Basketball    Undesirable

Currently I'm doing it this way:

df = pd.DataFrame(df.groupby(['code', 'type', 'value']).size().reset_index(name='count'))

df = df.sort_values(['type', 'count'])

df = pd.DataFrame(df.groupby(['code', 'type']).last().reset_index())

I've done some testing of this and it seems to do what I want, but I don't really like trusting the .last() call, and hoping in the case of a tie that Undesirable was sorted last. Is there a better way to group this to ensure I always get the higher count, or in the cases of a tie select the Undesirable value?

Performance isn't too much of an issue as I'm only working with around 50k rows or so.

CodePudding user response:

Case 1

If the value column only contains two values i.e. ['Acceptable', 'Undesirable'] then we can rely on the fact that Acceptable < Undesirable alphabetically. In this case you can use the following simplified solution.

Create an auxiliary column called count which contain the count of number of rows per code, type and value. Then sort the dataframe by count and value and drop the dupes per code and type keeping the last row.

c = ['code', 'type']
df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')

Case 2

If the value column contains other values and you can't rely on alphabetical ordering use the following solution which is similar to solution proposed in case 1 but this first converts the value column to ordered Categorical type before sorting

c = ['code', 'type']

df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df['value'] = pd.Categorical(df['value'], categories=['Acceptable', 'Undesirable'], ordered=True)

df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')

Result

   code        type        value  count
1  1111        Golf  Undesirable      1
4  1111  Basketball  Undesirable      2

CodePudding user response:

Another possible solution, which is based on the following ideas:

  1. Grouping the data by code and type.

  2. If a group has more than one row (len(x) > 1) and its rows have the same count (x['count'] == x['count'].min()).all()), return the row with Undesirable.

  3. Otherwise, return the row where the count is maximum (x.iloc[[x['count'].argmax()]]).

(df.groupby(['code', 'type', 'value'])['value'].size()
 .reset_index(name='count').groupby(['code', 'type'])
 .apply(lambda x: x.loc[x['value'] == 'Undesirable'] if 
        ((len(x) > 1) and (x['count'] == x['count'].min()).all()) else
        x.iloc[[x['count'].argmax()]])
 .reset_index(drop=True)
 .drop('count', axis=1))

Output:

   code        type        value
0  1111  Basketball  Undesirable
1  1111        Golf  Undesirable
  • Related