compare two dataframe with duplicates and pivot them-CodePudding

My input:

df1 = pd.DataFrame({'frame':[ 1,1,1,2,3],
                    'label':['GO','PL','ICV','CL','AO']}) 
df2 = pd.DataFrame({'frame':[ 1, 1, 2, 3, 4],
                    'label':['ICV','GO', 'CL','TI','PI']})
df_c = pd.concat([df1,df2])

I trying compare two df, frame by frame, and check if label in df1 existing in same frame in df2. And make some calucation with result (pivot for example)

That my code:

m_df = df1.merge(df2,on=['frame'],how='outer' )
m_df['cross']=m_df.apply(lambda row: 'Matched' 
                                            if row['label_x']==row['label_y'] 
                                            else 'Mismatched', axis='columns')
pv_m = pd.pivot_table(m_df,columns='cross',index='label_x',values='frame', aggfunc=pd.Series.nunique,margins=True, fill_value=0)

but this creates a problem:

in column All wrong values. as you see GO and ICV I expect in total 2, not 1 as now.
I think maybe there is a smarter way to compare df and pivot them?

CodePudding user response：

You can use isin() function like this:

df3 =df1[df1.label.isin(df2.label)]

CodePudding user response：

You got the unexpected result on margin total because the margin is making use the same function passed to aggfunc (i.e. pd.Series.nunique in this case) for its calculation and the values of Matched and Mismatched in these 2 rows are both the same as 1 (hence only one unique value of 1). (You are currently getting the unique count of frame id's)

Probably, you can achieve more or less what you want by taking the count on them (including margin, Matched and Mismatched) instead of the unique count of frame id's, by using pd.Series.count instead in the last line of codes:

pv_m = pd.pivot_table(m_df,columns='cross',index='label_x',values='frame', aggfunc=pd.Series.count, margins=True, fill_value=0)

Result

cross    Matched  Mismatched  All
label_x                          
AO             0           1    1
CL             1           0    1
GO             1           1    2
ICV            1           1    2
PL             0           2    2
All            3           5    8