Home > OS >  Optimal way to create a column by matching two other columns
Optimal way to create a column by matching two other columns

Time:08-26

The first df I have is one that has station codes and names, along with lat/long (not as relevant), like so:

code    name    latitude    longitude

I have another df with start/end dates for travel times. This df has only the station code, not the station name, like so:

start_date  start_station_code  end_date    end_station_code    duration_sec

I am looking to add columns that have the name of the start/end stations to the second df by matching the first df "code" and second df "start_station_code" / "end_station_code".

I am relatively new to pandas, and was looking for a way to optimize doing this as my current method takes quite a while. I use the following code:

for j in range(0, len(df_stations)):
    for i in range(0, len(df)):
        if(df_stations['code'][j] == df['start_station_code'][i]):
            df['start_station'][i] = df_stations['name'][j]
        if(df_stations['code'][j] == df['end_station_code'][i]):
            df['end_station'][i] = df_stations['name'][j]

I am looking for a faster method, any help is appreciated. Thank you in advance.

CodePudding user response:

Use merge. If you are familiar with SQL, merge is equivalent to LEFT JOIN:

cols = ["code", "name"]
result = (
    second_df
    .merge(first_df[cols], left_on="start_station_code", right_on="code")
    .merge(first_df[cols], left_on="end_station_code", right_on="code")
    .rename(columns={"code_x": "start_station_code", "code_y": "end_station_code"})
)

CodePudding user response:

The answer by @Code-Different is very nearly correct. However the columns to be renamed are the name columns not the code columns. For neatness you will likely want to drop the additional code columns that get created by the merges. Using your names for the dataframes df and df_station the code needed to produce df_required is:

cols = ["code", "name"]
required_df = (
    df
    .merge(df_stations[cols], left_on="start_station_code", right_on="code")
    .merge(df_stations[cols], left_on="end_station_code", right_on="code")
    .rename(columns={"name_x": "start_station", "name_y": "end_station"})
    .drop(columns = ['code_x', 'code_y'])
)

As you may notice the merge means that the dataframe acquires duplicate 'code' columns which get suffixed automatically, this is a built in default of the merge command. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more detail.

  • Related