finding lowest value in a column for each value in another column

I'm using pandas library for dataframes. In the below data, for each team, there are points for each month (1 - 6) for each year( 2020, 2019, 2018).

month  team  points2020  points2019  points2018
1      team1   50           10         5
2      team1   20           40         2
3      team1   12           14        17
4      team1   8            9          3
5      team1   2            3          1 
6      team1   30           18         60
1      team2   8            9          10
2      team2   40           70         30
3      team2   25           19         34
4      team2   88           70          1
5      team2   23           45          5
6      team2   55           77          90

What I'm trying to display is for each month, only show the team with the lowest points for each year

So for example, from the above data, for month '1' for 'points2020', I would want to only return team2 in 'team' column, since team2 has the lowest points for 'points2020'.

For month '1' for points2019, I would only want to return team2, in the team column since team2 has the lowest points for 'points2019', and so on.

How would I go about achieving this?

Example of desired output:

month  year  team   points
1      2020  team2   8
2      2020  team1   20
3      2020  team1   12
4      2020  team1   8
5      2020  team1   2
6      2020  team1   30
1      2019  team2   9
2      2019  team1   40
3      2019  team1   14
4      2019  team1   9
5      2019  team1   3
6      2019  team1   18

CodePudding user response：

Define team column as index and group by month then use idxmin to extract the team (index) with the lowest score:

out = df.set_index('team').groupby('month', as_index=False).idxmin()
print(out)

# Output
   month points2020 points2019 points2018
0      1      team2      team2      team1
1      2      team1      team1      team1
2      3      team1      team1      team1
3      4      team1      team1      team2
4      5      team1      team1      team1
5      6      team1      team1      team1

CodePudding user response：

Using df.melt to transform your columns into rows and then finding the rows with the minimum value after a groupby worked for me:

First, transform the point columns into rows (creates "year" and "points" columns)

>> df = df.melt(id_vars=["month", "team"], var_name="year", value_name="points")
>> print(df.head())
   month   team        year  points
0      1  team1  points2020      50
1      2  team1  points2020      20
2      3  team1  points2020      12
3      4  team1  points2020       8
4      5  team1  points2020       2

For each month and year, find the row with the minimum points

>> df = df.loc[df.groupby(["month", "year"]).points.idxmin()]

Sort values in a way that matches your expected output

>> print(df.sort_values(["year", "month"]))
    month   team        year  points
24      1  team1  points2018       5
25      2  team1  points2018       2
26      3  team1  points2018      17
33      4  team2  points2018       1
28      5  team1  points2018       1
29      6  team1  points2018      60
18      1  team2  points2019       9
13      2  team1  points2019      40
14      3  team1  points2019      14
15      4  team1  points2019       9
16      5  team1  points2019       3
17      6  team1  points2019      18
6       1  team2  points2020       8
1       2  team1  points2020      20
2       3  team1  points2020      12
3       4  team1  points2020       8
4       5  team1  points2020       2
5       6  team1  points2020      30

CodePudding user response：

Try this:

s = df.set_index(['month','team']).stack().rename_axis(['month','team','year'])

(s.loc[s.groupby(level=[0,2]).idxmin()]
 .sort_index(level=[2,0],ascending=[0,1])
 .reset_index(name='points')
 .assign(year = lambda x: x['year'].str.extract('(\d )',expand=False)))

Output:

    month   team  year  points
0       1  team2  2020       8
1       2  team1  2020      20
2       3  team1  2020      12
3       4  team1  2020       8
4       5  team1  2020       2
5       6  team1  2020      30
6       1  team2  2019       9
7       2  team1  2019      40
8       3  team1  2019      14
9       4  team1  2019       9
10      5  team1  2019       3
11      6  team1  2019      18

CodePudding user response：

You need a transformation from wide to long, before computing a groupby aggregation:

(
pd.wide_to_long(df, stubnames="points", i=["month", "team"], j="year")
.reset_index()
.groupby(["month", "year"], as_index=False, sort=False)
.agg(points=("points", "min"))
)

    month  year  points
0       1  2020       8
1       1  2019       9
2       1  2018       5
3       2  2020      20
4       2  2019      40
5       2  2018       2
6       3  2020      12
7       3  2019      14
8       3  2018      17
9       4  2020       8
10      4  2019       9
11      4  2018       1
12      5  2020       2
13      5  2019       3
14      5  2018       1
15      6  2020      30
16      6  2019      18
17      6  2018      60

Another option is to do the groupby first, before converting to long form(less number of rows when transforming into long form):

temp = df.groupby("month").min()
temp = temp.set_index('team', append = True)
temp.columns = temp.columns.str.split("(\d )", expand = True).droplevel(-1)
temp.columns.names = [None, 'year']
temp.stack().reset_index()

    month   team  year  points
0       1  team1  2018       5
1       1  team1  2019       9
2       1  team1  2020       8
3       2  team1  2018       2
4       2  team1  2019      40
5       2  team1  2020      20
6       3  team1  2018      17
7       3  team1  2019      14
8       3  team1  2020      12
9       4  team1  2018       1
10      4  team1  2019       9
11      4  team1  2020       8
12      5  team1  2018       1
13      5  team1  2019       3
14      5  team1  2020       2
15      6  team1  2018      60
16      6  team1  2019      18
17      6  team1  2020      30

The steps above can be abstracted with pivot_longer from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

(df
.groupby("month", as_index=False)
.min()
.pivot_longer(index = ["month", "team"],
              names_to = (".value", "year"),
              names_pattern = r"(\D )(\d )")
)

    month   team  year  points
0       1  team1  2020       8
1       2  team1  2020      20
2       3  team1  2020      12
3       4  team1  2020       8
4       5  team1  2020       2
5       6  team1  2020      30
6       1  team1  2019       9
7       2  team1  2019      40
8       3  team1  2019      14
9       4  team1  2019       9
10      5  team1  2019       3
11      6  team1  2019      18
12      1  team1  2018       5
13      2  team1  2018       2
14      3  team1  2018      17
15      4  team1  2018       1
16      5  team1  2018       1
17      6  team1  2018      60