Home > other >  Transform multiple rows of data into one based on multiple keys in pandas
Transform multiple rows of data into one based on multiple keys in pandas

Time:07-31

I have a large CSV file of sports data and I need to transform the data so that teams with the same game_id are on the same row and create new columns based on the homeAway column and existing columns. Is there a way to do this wih Pandas?

Existing format:

game_id     school      conference        homeAway  points
332410041   Connecticut American Athletic home      18
332410041   Towson      CAA               away      33

Desired format:

game_id     home_school home_conference   home_points  away_school  away_conference away_points
332410041   Connecticut American Athletic   18         Towson            CAA           33

CodePudding user response:

One way to solve this is to convert the table into a Pandas dataframe. Filter the main table by 'homeaway', to create 'home' and 'away' dataframes. The columns in the 'away' table are relabelled, and original column of the key is preserved. We then run a join to both to produce the desired output.

import pandas as pd
data = {'game_id': [332410041, 332410041],
        'school': ['Connecticut', 'Towson'],
        'conference':['American Athletic', 'CAA'],
        'homeAway': ['home', 'away'],
        'points': [18, 33]
}
df = pd.DataFrame(data)

home = df[df['homeAway'] == 'home']
del home['homeAway']

away = df[df['homeAway'] == 'away']
del away['homeAway']
away.columns = ['game_id', 'away_school', 'away_conference', 'away_points']

home.merge(away)

CodePudding user response:

  1. Create two dataframes selected by the unique values in the 'homeAway' column, 'home' and 'away', using Boolean indexing.
  2. Drop the obsolete 'homeAway' column
  3. Rename the appropriate columns with a 'home_', and 'away_' prefix.
import pandas as pd

# test dataframe
data = {'game_id': [332410041, 332410041, 662410041, 662410041, 772410041, 772410041],
        'school': ['Connecticut', 'Towson', 'NY', 'CA', 'FL', 'AL'],
        'conference': ['American Athletic', 'CAA', 'a', 'b', 'c', 'd'],
        'homeAway': ['home', 'away', 'home', 'away', 'home', 'away'], 'points': [18, 33, 1, 2, 3, 4]}
df = pd.DataFrame(data)

# create list of dataframes
dfl = [(df[df.homeAway.eq(loc)]
        .drop('homeAway', axis=1)
        .rename({'school': f'{loc}_school',
                 'conference': f'{loc}_conference',
                 'points': f'{loc}_points'}, axis=1))
       for loc in df.homeAway.unique()]

# combine the dataframes
df_new = pd.merge(dfl[0], dfl[1])

# display(df_new)
     game_id  home_school    home_conference  home_points away_school away_conference  away_points
0  332410041  Connecticut  American Athletic           18      Towson             CAA           33
1  662410041           NY                  a            1          CA               b            2
2  772410041           FL                  c            3          AL               d            4
  • Related