Mapping team changes over time in Python-CodePudding

I have the following dataframe in Python that lists team lineage for a Formula 1 team in that it identifies when teams change name and what they change to:

import pandas as pd

teams = pd.DataFrame({'type': {0: 'new', 1: 'change', 2: 'change', 3: 'change'},
 'orig_event_id': {0: nan, 1: '2009-01-Q', 2: '2012-01-Q', 3: '2017-01-Q'},
 'orig_team_id': {0: nan,
  1: 'Tyrrell_Racing_Organisation',
  2: 'Mercedes_AMG_Petronas_F1_Team',
  3: 'Mercedes_AMG_Petronas_Motorsport'},
 'event_id': {0: '1970-01-Q', 1: '2010-01-Q', 2: '2017-01-Q', 3: '2020-01-Q'},
 'new_team_id': {0: 'Tyrrell_Racing_Organisation',
  1: 'Mercedes_AMG_Petronas_F1_Team',
  2: 'Mercedes_AMG_Petronas_Motorsport',
  3: 'Mercedes_AMG_Petronas_F1_Team'}})

The 'orig_event_id' is the event in which the team name changes and the 'event_id' is how long the name lasts to, if that helps.

I also have this dataframe of drivers and races that I would like to change the data of:

driver = pd.DataFrame({'index': {0: 22258, 1: 22279, 2: 22300, 3: 22317, 4: 22337},
 'driver_id': {0: 'Lewis_Hamilton',
  1: 'Lewis_Hamilton',
  2: 'Lewis_Hamilton',
  3: 'Lewis_Hamilton',
  4: 'Lewis_Hamilton'},
 'team_id': {0: 'Mercedes_AMG_Petronas_Motorsport',
  1: 'Mercedes_AMG_Petronas_Motorsport',
  2: 'Mercedes_AMG_Petronas_Motorsport',
  3: 'Mercedes_AMG_Petronas_Motorsport',
  4: 'Mercedes_AMG_Petronas_Motorsport'},
 'event_id': {0: '2018-01-R',
  1: '2018-02-R',
  2: '2018-03-R',
  3: '2018-04-R',
  4: '2018-05-R'},
 'season': {0: 2018, 1: 2018, 2: 2018, 3: 2018, 4: 2018},
 'stage': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}})

So essentially what I want is the team_id in the driver df to change from 'Mercedes_AMG_Petronas_Motorsport' to 'Mercedes_AMG_Petronas_F1_Team' and all of the team_id with 'Tyrrell_Racing_Organisation' to also change to 'Mercedes_AMG_Petronas_F1_Team' so that they're consistent from year-to-year.

Is there an easy way to do this? Let me know and appreciate any help!

CodePudding user response：

You can try

driver['team_id'] = driver['team_id'].map(dict(zip(teams['orig_team_id'], teams['new_team_id'])))

# Or

driver['team_id'] = driver['team_id'].map(teams.set_index('orig_team_id')['new_team_id'])

print(df)

   index       driver_id                        team_id   event_id  season  stage
0  22258  Lewis_Hamilton  Mercedes_AMG_Petronas_F1_Team  2018-01-R    2018      1
1  22279  Lewis_Hamilton  Mercedes_AMG_Petronas_F1_Team  2018-02-R    2018      2
2  22300  Lewis_Hamilton  Mercedes_AMG_Petronas_F1_Team  2018-03-R    2018      3
3  22317  Lewis_Hamilton  Mercedes_AMG_Petronas_F1_Team  2018-04-R    2018      4
4  22337  Lewis_Hamilton  Mercedes_AMG_Petronas_F1_Team  2018-05-R    2018      5

CodePudding user response：

I can't see a particularly easy or vectorized way to be sure of replacing every team ID with the latest corresponding team ID. You can write a function that iterates through the rows, creating a series that maps the original team IDs to latest team IDs.

def get_latest_team_ids(df):
    # Make a dataframe of team ID changes in date order
    df = df.sort_values("event_id") \
        .loc[df["type"] == "change", ["orig_team_id", "new_team_id"]] \
        .reset_index(drop=True)

    # Replace earlier occurrences of a changed ID (e.g. A => B, B => C becomes A => C, B => C)
    for i, (orig, new) in df[1:].iterrows():
        df.loc[:i-1, "new_team_id"].replace(orig, new, inplace=True)

    # Return it as a series mapping original ID to latest ID
    return df.set_index("orig_team_id")["new_team_id"]

Then, use this series to map the team IDs in the driver dataframe, as in @Ynjxsjmh's answer.

team_mapping = get_latest_team_ids(teams)
driver["latest_team_id"] = driver["team_id"].map(team_mapping)