I have a large CSV file of sports data and I need to transform the data so that teams with the same game_id are on the same row and create new columns based on the homeAway column and existing columns. Is there a way to do this wih Pandas?
Existing format:
game_id school conference homeAway points
332410041 Connecticut American Athletic home 18
332410041 Towson CAA away 33
Desired format:
game_id home_school home_conference home_points away_school away_conference away_points
332410041 Connecticut American Athletic 18 Towson CAA 33
CodePudding user response:
One way to solve this is to convert the table into a Pandas dataframe. Filter the main table by 'homeaway', to create 'home' and 'away' dataframes. The columns in the 'away' table are relabelled, and original column of the key is preserved. We then run a join to both to produce the desired output.
import pandas as pd
data = {'game_id': [332410041, 332410041],
'school': ['Connecticut', 'Towson'],
'conference':['American Athletic', 'CAA'],
'homeAway': ['home', 'away'],
'points': [18, 33]
}
df = pd.DataFrame(data)
home = df[df['homeAway'] == 'home']
del home['homeAway']
away = df[df['homeAway'] == 'away']
del away['homeAway']
away.columns = ['game_id', 'away_school', 'away_conference', 'away_points']
home.merge(away)
CodePudding user response:
- Create two dataframes selected by the unique values in the
'homeAway'
column,'home'
and'away'
, using Boolean indexing. - Drop the obsolete
'homeAway'
column - Rename the appropriate columns with a
'home_'
, and'away_'
prefix.
- This can be done in a for-loop, with each dataframe added to a list, which can be consolidated into a simple list-comprehension.
- Use
pd.merge
to combine the two dataframes on the common'game_id'
column.- See Merge, join, concatenate and compare and Pandas Merging 101 for additional details.
import pandas as pd
# test dataframe
data = {'game_id': [332410041, 332410041, 662410041, 662410041, 772410041, 772410041],
'school': ['Connecticut', 'Towson', 'NY', 'CA', 'FL', 'AL'],
'conference': ['American Athletic', 'CAA', 'a', 'b', 'c', 'd'],
'homeAway': ['home', 'away', 'home', 'away', 'home', 'away'], 'points': [18, 33, 1, 2, 3, 4]}
df = pd.DataFrame(data)
# create list of dataframes
dfl = [(df[df.homeAway.eq(loc)]
.drop('homeAway', axis=1)
.rename({'school': f'{loc}_school',
'conference': f'{loc}_conference',
'points': f'{loc}_points'}, axis=1))
for loc in df.homeAway.unique()]
# combine the dataframes
df_new = pd.merge(dfl[0], dfl[1])
# display(df_new)
game_id home_school home_conference home_points away_school away_conference away_points
0 332410041 Connecticut American Athletic 18 Towson CAA 33
1 662410041 NY a 1 CA b 2
2 772410041 FL c 3 AL d 4