Merge two columns from multiple panda series dataframes based on string matching from two columns wi-CodePudding

I need to merge two columns from a pandas series dataframe together on the last 4 digits of the first column pack_number. I currently have 2 dataframes with a different number of columns.

So far, I thought about extracting the last 4 digits of the ROOT_VIN but I'm not sure how to proceed with column matching and merging these tables together.

Ideally, I'd like to merge the dataframe including pack_number (which includes 6 other columns with different values) into the dataframe including ROOT_VIN which also has other columns with different lengths. The goal is to include the pack_number and match it with its ROOT_VIN.

After merging, i now have 108k rows (matching the 108k rows for df2) and many of the rows are duplicated because df1 only had 451 rows.

CodePudding user response：

If I understand right then you have 2 data frames with a lot of columns.

# df1 # pack_number Table
# df2 # ROOT_VIN Table

# @BeRT2me solution
df2['pack_number'] = df2['ROOT_VIN'].str[-4:]

# joining the 2 dataframes
# we add '_remove' to the duplicate columns names
df3 = df1.merge(df2, how='inner', on='pack_number', suffixes=('', '_remove'))

# remove the columns that end with '_remove'
df3.drop(
    [col_name for col_name in df_3.columns if '_remove' in col_name],
    axis=1, inplace=True
)