I need to compare two datasets:
DF1
Subj 1 2 3
0 Biotech Cell culture Bioinfo Immunology
1 Zoology Cell culture Immunology NaN
2 Math Trigonometry Algebra NaN
3 Microbio Biotech NaN NaN
4 Physics Optics NaN NaN
DF2
Subj 1 2
0 Biotech Bioinfo Immunology
1 Zoology Immunology Botany
2 Microbio NaN NaN
3 Physics Optics Quantumphy
4 Math Trigonometry NaN
How I want my result dataframe:
Subj 1 2
0 Biotech Bioinfo Immunology
1 Zoology Immunology NaN
2 Math Trigonometry NaN
3 Physics Optics NaN
I can't check row by row as the datasets are huge. The number of columns varies for both datasets, but rows are the same in number. Since the order of the row elements also vary, I can't simply use merge(). I tried compare function, but it either removes all common elements or forms a dataframe containing both. I can't seem to pick out just the common elements.
CodePudding user response:
You can match columns and then set the subject column as an index while merging the dataframes:
match=df2.columns.intersection(df1.columns).tolist()
df2.merge(df1,on=match, how='left').reindex(df2.columns,axis=1).set_index('Subj').dropna(how='all')
which returns:
1 2
Subj
Biotech Bioinfo Immunology
Zoology Immunology NaN
Math Trigonometry NaN
Physics Optics NaN
CodePudding user response:
here is one way to do it
Understanding: number of column varies and and values in two DF are not under same column
# Stack both the DFs, after setting Subj as index
# this results in changing a wide format to long format
# concat the two DF to forma new DF
df3=pd.concat([df.set_index('Subj').stack().reset_index().rename(columns={0:'val'}),
df2.set_index('Subj').stack().reset_index().rename(columns={0:'val'})],
).reset_index()
# to find the same topic under a subject if it exists in two DFs
# the join will have duplicate rows
# so find the duplicated rows for Subj and Topic (val column)
# group the duplicated rows and aggregate to a comma separated values
# finally split on comma to create new columns
out=(df3[df3.duplicated(subset=['Subj','val'])]
.groupby('Subj')['val']
.agg(','.join)
.str
.split(',',expand=True).reset_index())
out
Subj 0 1
0 Biotech Bioinfo Immunology
1 Math Trigonometry None
2 Physics Optics None
3 Zoology Immunology None