I was hoping somebody could help me with increasing the efficiency of my for loops. I read that numpy vectorisation increases speed by "up to 74,000 faster", but didn't find much documentation on it.
I was wondering how others would increase the speed of this code.
from mplsoccer import Sbapi,Sbopen
import pandas as pd
from tqdm import tqdm
import numpy as np
import io
parser = Sbopen(dataframe=True)
matches = parser.match(43,3)
matchids = list(matches.match_id.unique())
data = []
#The following for-loop concates approx 429 Dataframes of up to 30,000 rows each
for i in tqdm(matchids):
event = parser.event(i)[0]
data.append(event)
data = pd.concat(data)
data = data.merge(matches[['match_id','home_team_name','away_team_name']], on='match_id')
data = data[(data['type_name']=='Pass') & (data['outcome_name'].isna()) | (data['type_name']=='Carry') & (data['outcome_name'].isna())]
data = data[['match_id','type_name','possession_team_name','home_team_name','away_team_name','x','y','end_x','end_y']]
xT = """
0.00638303 0.00779616 0.00844854 0.00977659 0.01126267 0.01248344 0.01473596 0.0174506 0.02122129 0.02756312 0.03485072 0.0379259
0.00750072 0.00878589 0.00942382 0.0105949 0.01214719 0.0138454 0.01611813 0.01870347 0.02401521 0.02953272 0.04066992 0.04647721
0.0088799 0.00977745 0.01001304 0.01110462 0.01269174 0.01429128 0.01685596 0.01935132 0.0241224 0.02855202 0.05491138 0.06442595
0.00941056 0.01082722 0.01016549 0.01132376 0.01262646 0.01484598 0.01689528 0.0199707 0.02385149 0.03511326 0.10805102 0.25745362
0.00941056 0.01082722 0.01016549 0.01132376 0.01262646 0.01484598 0.01689528 0.0199707 0.02385149 0.03511326 0.10805102 0.25745362
0.0088799 0.00977745 0.01001304 0.01110462 0.01269174 0.01429128 0.01685596 0.01935132 0.0241224 0.02855202 0.05491138 0.06442595
0.00750072 0.00878589 0.00942382 0.0105949 0.01214719 0.0138454 0.01611813 0.01870347 0.02401521 0.02953272 0.04066992 0.04647721
0.00638303 0.00779616 0.00844854 0.00977659 0.01126267 0.01248344 0.01473596 0.0174506 0.02122129 0.02756312 0.03485072 0.0379259
"""
xT = pd.read_csv(io.StringIO(xT), sep="\t",header=None)
xT = np.array(xT)
xT_rows, xT_cols = xT.shape
data['x1_bin'] = pd.cut(data['x'], bins=xT_cols, labels=False)
data['y1_bin'] = pd.cut(data['y'], bins=xT_rows, labels=False)
data['x2_bin'] = pd.cut(data['end_x'], bins=xT_cols, labels=False)
data['y2_bin'] = pd.cut(data['end_y'], bins=xT_rows, labels=False)
data['start_zone_value'] = data[['x1_bin', 'y1_bin']].apply(lambda x: xT[x[1]][x[0]], axis=1)
data['end_zone_value'] = data[['x2_bin', 'y2_bin']].apply(lambda x: xT[x[1]][x[0]], axis=1)
data['xT'] = data['end_zone_value'] - data['start_zone_value']
hometeamxt = []
awayteamxt = []
for match_id in tqdm(data['match_id']):
match = data[data['match_id'] == match_id]
home_xt = match[match['possession_team_name'] == match['home_team_name']]['xT'].sum()
away_xt = match[match['possession_team_name'] == match['away_team_name']]['xT'].sum()
hometeamxt.append(home_xt)
awayteamxt.append(away_xt)
data['homext']=hometeamxt
data['awayxt']=awayteamxt
xtData = data.drop_duplicates(subset = 'match_id', keep = 'first').drop(['xT', 'possession_team_name'], axis = 1)
the slowest for loop tends to be the first one
for i in tqdm(matchids):
event = parser.event(i)[0]
data.append(event)
data = pd.concat(data)
CodePudding user response:
The block you cite isn't the slower part of your code. All the time of consumed by that block is spent inside the parser.event
function. This function has 2 operations:
- Get data from Github
- Process that data into a DataFrame
The first operation can be made faster by using some asynchronous IO, but the second operation is CPU bound and thus harder to parallelize. So unless you want to update the Statsbomb library, there is little you can do here.
The slowest is the second tqdm
loop:
for match_id in tqdm(data['match_id']):
match = data[data['match_id'] == match_id]
home_xt = match[match['possession_team_name'] == match['home_team_name']]['xT'].sum()
away_xt = match[match['possession_team_name'] == match['away_team_name']]['xT'].sum()
hometeamxt.append(home_xt)
awayteamxt.append(away_xt)
Here, a bad algorithm makes the for
loop even worse. You want to find the summary for 64 matches, split into home and away sides. However, you iter through 100k data points for those matches, calculate the summary again and again, only to drop the duplicates at the end.
It's obviously a lot better to calculate the summary only once:
# Indicate if the stats on the row are for Home or Away team
is_home = data["possession_team_name"] == data["home_team_name"]
# The total xT for each match, split by Home (True) and Away (False)
xT = data.groupby([is_home, "match_id"])["xT"].sum()
# Combine the pieces together
xtData = pd.concat([
# Here's the big stat dataframe. We no longer need all the detailed stats,
# just some basic info about each match. In fact, I think you only need the
# match_id, home_team_name and away_team_name from this frame.
data.drop(columns=["possession_team_name", "xT"]).drop_duplicates(subset="match_id").set_index("match_id"),
# Extract the xT for the Home team
xT.xs(True).rename("homext"),
# Extract the xT for the Away team
xT.xs(False).rename("awayxt"),
], axis=1)
The above code took 52ms vs. 8m of the original on my computer.