I am trying to merge 7 different data frames on the basis of same column (accident_no) but the problem is some data frame contains more rows and duplication of (accident_no) e.g
table 1(Accident) contains 200 accident_no (all unique), table 3 contains 196 accident_no (all unique) but table 4 (Person) contains 400 accident_no (some duplications) as there may be multiple passengers were involved in the same crash so accident_no would be same and information can be used for analysis.
The problem I am facing is I have tried concat, join, merge but the answer reaches the highest number of rows and I am getting more rows than 400.
So far I tried below methods:
dfs = [df1,df2,df3,df5,df6,df7]
df_final = reduce(lambda left,right: pd.merge(left,right,on='ACCIDENT_NO', how = 'left'), dfs)
AND
dfs = [df.set_index(['ACCIDENT_NO']) for df in [df1, df2, df3, df4, df5, df6, df7]]
print(pd.concat(dfs, axis=1).reset_index())
So, is it possible that I may get more rows than 400 or am I doing something wrong?
Thanks
CodePudding user response:
you can try ;
table1 = table1.merge(table2,on = ['accident_no'],how = 'left')
and try for other tables.
CodePudding user response:
Consider creating a person count column with groupby().cumcount()
in each data frame, then concatenate on person and accident identifiers:
dfs = [
(df.assign(
PERSON_NO = lambda x: x.groupby(["ACCIDENT_NO"]).cumcount().add(1)
).set_index(["PERSON_NO", "ACCIDENT_NO"])
)
for df in [df1, df2, df3, df4, df5, df6, df7]
]
final_df = pd.concat(dfs, axis=1).reset_index()