i scraped a html table from a nba game as an pandas dataframe.
import pandas as pd
url = 'https://www.basketball-reference.com/boxscores/pbp/200911060GSW.html'
dfs = pd.read_html(url)
df = dfs[0]
df.rename(columns={'Unnamed: 2_level_1': 'PM1', 'Unnamed: 4_level_1': 'PM2'}, inplace=True)
df
i have the column "time" which starts at 12:00.0 and is descending to 0:00.0 and this for every quarter. i want the time as overall time so that it begins at 48:00.0 and is descending.
my approach: overall_time(i) = overall_time(i-1) - (quarter_time(i-1) - quarter_time(i)) e.g. 48:00.0 - (12:00.0 - 11:46.0) = 47:46.0 for the first row of my dataframe
i think this should be working but i am struggling to implement this in python. maybe someone can help me with this
CodePudding user response:
There is probably a better way, but I felt I needed to converting from Time 'string' format like 11:30 which is hard to subtract, to 11.5 and then back again. Then a bit of fussing with formatting
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
url = 'https://www.basketball-reference.com/boxscores/pbp/200911060GSW.html'
dfs = pd.read_html(url)
df = dfs[0]
df.rename(columns={'Unnamed: 2_level_1': 'PM1', 'Unnamed: 4_level_1': 'PM2'}, inplace=True)
df.columns = df.columns.droplevel() #columns currently multiindex, you don't need 1st Q, drop it
df = df[df['Time'].str.contains(':')] #only include rows with a real 'Time' that contains a colon, excludes headers
#Identify which rows signify the start of a new quarter
#has to have 12 minutes of time and text of 'Start of...' in the 'Score' column
quarter_start_rows = df['Time'].eq('12:00.0') & df['Score'].str.startswith('Start of')
#create a new column called quarter w/ 1 at new quarter, 0 otherwise then cumsum
df['Quarter'] = np.where(quarter_start_rows,1,0).cumsum()
#separate the minutes and seconds and make them int and float respectively
df[['Minutes','Seconds']] = df['Time'].str.split(':',expand=True).astype({0:'int',1:'float'})
#represent Q2 11:30 as 11.5 etc so it is easy to add/subtract times
fractional_time = df['Minutes'].add(df['Seconds'].div(60))
#convert from Q2 11:30 (11.5) to 'global time' which would be 35.5
global_fractional_time = fractional_time.add((4-df['Quarter'])*12)
#convert from fractional time back to Minutes and Seconds
minutes = global_fractional_time.astype(int)
seconds = global_fractional_time.sub(minutes).multiply(60).round(1)
#Make a new string column to show the global minutes and seconds more nicely
df['Overall Time'] = minutes.astype(str).str.zfill(2) ':' seconds.astype(str).str.zfill(4)
Output