Multiple data frames contains one same column-CodePudding

I am trying to merge 7 different data frames on the basis of same column (accident_no) but the problem is some data frame contains more rows and duplication of (accident_no) e.g

table 1(Accident) contains 200 accident_no (all unique), table 3 contains 196 accident_no (all unique) but table 4 (Person) contains 400 accident_no (some duplications) as there may be multiple passengers were involved in the same crash so accident_no would be same and information can be used for analysis.

The problem I am facing is I have tried concat, join, merge but the answer reaches the highest number of rows and I am getting more rows than 400.

So far I tried below methods:

dfs = [df1,df2,df3,df5,df6,df7]
df_final = reduce(lambda left,right: pd.merge(left,right,on='ACCIDENT_NO', how = 'left'), dfs)

AND

dfs = [df.set_index(['ACCIDENT_NO']) for df in [df1, df2, df3, df4, df5, df6, df7]]

print(pd.concat(dfs, axis=1).reset_index())

So, is it possible that I may get more rows than 400 or am I doing something wrong?

Thanks

CodePudding user response：

you can try ;

table1 = table1.merge(table2,on = ['accident_no'],how = 'left')

and try for other tables.

CodePudding user response：

Consider creating a person count column with groupby().cumcount() in each data frame, then concatenate on person and accident identifiers:

dfs = [
    (df.assign(
        PERSON_NO = lambda x: x.groupby(["ACCIDENT_NO"]).cumcount().add(1)
       ).set_index(["PERSON_NO", "ACCIDENT_NO"])
    )
    for df in [df1, df2, df3, df4, df5, df6, df7]
]

final_df = pd.concat(dfs, axis=1).reset_index()