I'm using pandas library for dataframes. In the below data, for each team, there are points for each month (1 - 6) for each year( 2020, 2019, 2018).
month team points2020 points2019 points2018
1 team1 50 10 5
2 team1 20 40 2
3 team1 12 14 17
4 team1 8 9 3
5 team1 2 3 1
6 team1 30 18 60
1 team2 8 9 10
2 team2 40 70 30
3 team2 25 19 34
4 team2 88 70 1
5 team2 23 45 5
6 team2 55 77 90
What I'm trying to display is for each month, only show the team with the lowest points for each year
So for example, from the above data, for month '1' for 'points2020', I would want to only return team2 in 'team' column, since team2 has the lowest points for 'points2020'.
For month '1' for points2019, I would only want to return team2, in the team column since team2 has the lowest points for 'points2019', and so on.
How would I go about achieving this?
Example of desired output:
month year team points
1 2020 team2 8
2 2020 team1 20
3 2020 team1 12
4 2020 team1 8
5 2020 team1 2
6 2020 team1 30
1 2019 team2 9
2 2019 team1 40
3 2019 team1 14
4 2019 team1 9
5 2019 team1 3
6 2019 team1 18
CodePudding user response:
Define team
column as index and group by month
then use idxmin
to extract the team (index) with the lowest score:
out = df.set_index('team').groupby('month', as_index=False).idxmin()
print(out)
# Output
month points2020 points2019 points2018
0 1 team2 team2 team1
1 2 team1 team1 team1
2 3 team1 team1 team1
3 4 team1 team1 team2
4 5 team1 team1 team1
5 6 team1 team1 team1
CodePudding user response:
Using df.melt
to transform your columns into rows and then finding the rows with the minimum value after a groupby worked for me:
First, transform the point columns into rows (creates "year" and "points" columns)
>> df = df.melt(id_vars=["month", "team"], var_name="year", value_name="points") >> print(df.head()) month team year points 0 1 team1 points2020 50 1 2 team1 points2020 20 2 3 team1 points2020 12 3 4 team1 points2020 8 4 5 team1 points2020 2
For each month and year, find the row with the minimum points
>> df = df.loc[df.groupby(["month", "year"]).points.idxmin()]
Sort values in a way that matches your expected output
>> print(df.sort_values(["year", "month"])) month team year points 24 1 team1 points2018 5 25 2 team1 points2018 2 26 3 team1 points2018 17 33 4 team2 points2018 1 28 5 team1 points2018 1 29 6 team1 points2018 60 18 1 team2 points2019 9 13 2 team1 points2019 40 14 3 team1 points2019 14 15 4 team1 points2019 9 16 5 team1 points2019 3 17 6 team1 points2019 18 6 1 team2 points2020 8 1 2 team1 points2020 20 2 3 team1 points2020 12 3 4 team1 points2020 8 4 5 team1 points2020 2 5 6 team1 points2020 30
CodePudding user response:
Try this:
s = df.set_index(['month','team']).stack().rename_axis(['month','team','year'])
(s.loc[s.groupby(level=[0,2]).idxmin()]
.sort_index(level=[2,0],ascending=[0,1])
.reset_index(name='points')
.assign(year = lambda x: x['year'].str.extract('(\d )',expand=False)))
Output:
month team year points
0 1 team2 2020 8
1 2 team1 2020 20
2 3 team1 2020 12
3 4 team1 2020 8
4 5 team1 2020 2
5 6 team1 2020 30
6 1 team2 2019 9
7 2 team1 2019 40
8 3 team1 2019 14
9 4 team1 2019 9
10 5 team1 2019 3
11 6 team1 2019 18
CodePudding user response:
You need a transformation from wide to long, before computing a groupby aggregation:
(
pd.wide_to_long(df, stubnames="points", i=["month", "team"], j="year")
.reset_index()
.groupby(["month", "year"], as_index=False, sort=False)
.agg(points=("points", "min"))
)
month year points
0 1 2020 8
1 1 2019 9
2 1 2018 5
3 2 2020 20
4 2 2019 40
5 2 2018 2
6 3 2020 12
7 3 2019 14
8 3 2018 17
9 4 2020 8
10 4 2019 9
11 4 2018 1
12 5 2020 2
13 5 2019 3
14 5 2018 1
15 6 2020 30
16 6 2019 18
17 6 2018 60
Another option is to do the groupby first, before converting to long form(less number of rows when transforming into long form):
temp = df.groupby("month").min()
temp = temp.set_index('team', append = True)
temp.columns = temp.columns.str.split("(\d )", expand = True).droplevel(-1)
temp.columns.names = [None, 'year']
temp.stack().reset_index()
month team year points
0 1 team1 2018 5
1 1 team1 2019 9
2 1 team1 2020 8
3 2 team1 2018 2
4 2 team1 2019 40
5 2 team1 2020 20
6 3 team1 2018 17
7 3 team1 2019 14
8 3 team1 2020 12
9 4 team1 2018 1
10 4 team1 2019 9
11 4 team1 2020 8
12 5 team1 2018 1
13 5 team1 2019 3
14 5 team1 2020 2
15 6 team1 2018 60
16 6 team1 2019 18
17 6 team1 2020 30
The steps above can be abstracted with pivot_longer
from pyjanitor
:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.groupby("month", as_index=False)
.min()
.pivot_longer(index = ["month", "team"],
names_to = (".value", "year"),
names_pattern = r"(\D )(\d )")
)
month team year points
0 1 team1 2020 8
1 2 team1 2020 20
2 3 team1 2020 12
3 4 team1 2020 8
4 5 team1 2020 2
5 6 team1 2020 30
6 1 team1 2019 9
7 2 team1 2019 40
8 3 team1 2019 14
9 4 team1 2019 9
10 5 team1 2019 3
11 6 team1 2019 18
12 1 team1 2018 5
13 2 team1 2018 2
14 3 team1 2018 17
15 4 team1 2018 1
16 5 team1 2018 1
17 6 team1 2018 60