df=pd.DataFrame({"date":['2021-06-02', '2021-06-07', '2021-06-15', '2021-06-19', '2021-06-23', '2021-09-05', '2021-10-08', '2021-11-11', '2022-03-26', '1950-05-28', '1953-10-11', '1954-06-01', '1955-05-28', '1955-10-30', '1956-12-16', '1957-10-27', '1958-02-20', '1959-11-21', '1960-02-07', '1960-07-03'],
"home_team":['Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana'],
"away_team":['Denmark', 'Latvia', 'France', 'Portugal', 'Hungary', 'Armenia', 'Romania', 'Liechtenstein', 'Israel', 'Nigeria', 'Nigeria', 'Sierra Leone', 'Nigeria', 'Nigeria', 'Sierra Leone', 'Nigeria', 'Togo', 'Nigeria', 'Sierra Leone', 'Egypt'],
"home_score":[1, 7, 0, 4, 2, 6, 2, 9, 2, 1, 1, 2, 1, 7, 4, 3, 5, 5, 6, 2],
"away_score":[1, 1, 1, 2, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 2, 2, 2]
})
date home_team away_team home_score away_score
0 2021-06-02 Germany Denmark 1 1
1 2021-06-07 Germany Latvia 7 1
2 2021-06-15 Germany France 0 1
3 2021-06-19 Germany Portugal 4 2
4 2021-06-23 Germany Hungary 2 2
5 2021-09-05 Germany Armenia 6 0
6 2021-10-08 Germany Romania 2 1
7 2021-11-11 Germany Liechtenstein 9 0
8 2022-03-26 Germany Israel 2 0
9 1950-05-28 Ghana Nigeria 1 0
10 1953-10-11 Ghana Nigeria 1 0
11 1954-06-01 Ghana Sierra Leone 2 0
12 1955-05-28 Ghana Nigeria 1 0
13 1955-10-30 Ghana Nigeria 7 0
14 1956-12-16 Ghana Sierra Leone 4 3
15 1957-10-27 Ghana Nigeria 3 3
16 1958-02-20 Ghana Togo 5 2
17 1959-11-21 Ghana Nigeria 5 2
18 1960-02-07 Ghana Sierra Leone 6 2
19 1960-07-03 Ghana Egypt 2 2
In this pandas dataframe I want to calculate the longest consecutive home win length for every team. I want my result to be like this:
Team HomeWinChain
Germany 4
Ghana 6
I've tried different things but couldn't get the desired result.
CodePudding user response:
You can use a double groupby
, first to identify the stretches of wins, then to get the max one:
# is the row a win?
win = df['home_score'].gt(df['away_score'])
# subselect wins, groupby consecutive wins, get size, get max size
(win[win]
.groupby([df['home_team'], (~win).cumsum()])
.size()
.groupby(level='home_team').max()
)
output:
home_team
Germany 4
Ghana 6
dtype: int64
intermediate with all the stretches:
(win[win]
.groupby([df['home_team'], (~win).cumsum()])
.size()
#.groupby(level='home_team').max()
)
output:
home_team
Germany 1 1
2 1
3 4
Ghana 3 6
4 3
dtype: int64
CodePudding user response:
import pandas as pd
df=pd.DataFrame({"date":['2021-06-02', '2021-06-07', '2021-06-15', '2021-06-19', '2021-06-23', '2021-09-05', '2021-10-08', '2021-11-11', '2022-03-26', '1950-05-28', '1953-10-11', '1954-06-01', '1955-05-28', '1955-10-30', '1956-12-16', '1957-10-27', '1958-02-20', '1959-11-21', '1960-02-07', '1960-07-03'],
"home_team":['Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana', 'Ghana'],
"away_team":['Denmark', 'Latvia', 'France', 'Portugal', 'Hungary', 'Armenia', 'Romania', 'Liechtenstein', 'Israel', 'Nigeria', 'Nigeria', 'Sierra Leone', 'Nigeria', 'Nigeria', 'Sierra Leone', 'Nigeria', 'Togo', 'Nigeria', 'Sierra Leone', 'Egypt'],
"home_score":[1, 7, 0, 4, 2, 6, 2, 9, 2, 1, 1, 2, 1, 7, 4, 3, 5, 5, 6, 2],
"away_score":[1, 1, 1, 2, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 2, 2, 2]
})
df['HomeWinChain'] = 0
cnt = 1
for i in range(len(df[df['home_team'] == 'Germany'])):
if df['home_score'][i] > df['away_score'][i]:
df['HomeWinChain'][i] = cnt
cnt = 1
else:
cnt = 1
cnt = 1
for i in range(len(df[df['home_team'] == 'Germany']), len(df)):
if df['home_score'][i] > df['away_score'][i]:
df['HomeWinChain'][i] = cnt
cnt = 1
else:
cnt = 1
df[['home_team','HomeWinChain']].groupby('home_team').aggregate({'HomeWinChain':'max'})
You can do well. You can think about how to reduce the code.