I need to merge two columns from a pandas series dataframe together on the last 4 digits of the first column pack_number
.
I currently have 2 dataframes with a different number of columns.
So far, I thought about extracting the last 4 digits of the ROOT_VIN
but I'm not sure how to proceed with column matching and merging these tables together.
Ideally, I'd like to merge the dataframe including pack_number
(which includes 6 other columns with different values) into the dataframe including ROOT_VIN
which also has other columns with different lengths. The goal is to include the pack_number
and match it with its ROOT_VIN
.
After merging, i now have 108k rows (matching the 108k rows for df2) and many of the rows are duplicated because df1 only had 451 rows.
CodePudding user response:
If I understand right then you have 2 data frames with a lot of columns.
# df1 # pack_number Table
# df2 # ROOT_VIN Table
# @BeRT2me solution
df2['pack_number'] = df2['ROOT_VIN'].str[-4:]
# joining the 2 dataframes
# we add '_remove' to the duplicate columns names
df3 = df1.merge(df2, how='inner', on='pack_number', suffixes=('', '_remove'))
# remove the columns that end with '_remove'
df3.drop(
[col_name for col_name in df_3.columns if '_remove' in col_name],
axis=1, inplace=True
)