Pandas dataframe compare columns of different sizes in different dataframes-CodePudding

I'm trying to compare two dataframes:

df1:

       entry  mass  Precursor  mass_pos
0     KGTLPK   128    642.780   770.780
1     KGTLPK    48    642.780   690.780
2     KGTLPK   112    642.780   754.780
3     KGTLPK    32    642.780   674.780
4     KGTLPK   156    642.780   798.780

df2:

My goal is to find any matches of df2 'mass' with df1 'mass_pos'.

I really liked this:

df1['masses match'] = np.where(df2['Mass'] == df1['mass_pos'], 'True', 'False')

But this throws a value error:

ValueError: Can only compare identically-labeled Series objects

I think that this is because these dataframes have a different number of rows. Is there a way to overcome this?

CodePudding user response：

Use np.isclose:

tolerance = 1e-03
match_mass = lambda x: np.any(np.isclose(x, df2['Mass'], atol=tolerance))
df1['masses match'] = df1['mass_pos'].apply(match_mass)
print(df1)

# Output:
    entry  mass  Precursor  mass_pos  masses match
0  KGTLPK   128     642.78    770.78         False
1  KGTLPK    48     642.78    690.78          True
2  KGTLPK   112     642.78    754.78         False
3  KGTLPK    32     642.78    674.78         False
4  KGTLPK   156     642.78    798.78          True

CodePudding user response：

Use merge to check exist item.

Example:

df=df1.merge(df2,left_on="mass_pos",right_on="Mass")

Code:

import pandas as pd

df1 = pd.DataFrame({'entry': ['KGTLPK','KGTLPK','KGTLPK','KGTLPK','KGTLPK' ],
                    'mass': [128 ,48, 112 , 32, 156],
                    'Precursor': [642.780,642.780,642.780,642.780,642.780],
                    'mass_pos': [ 770.780, 690.780, 754.780, 674.780, 798.780]
                    })

df2 = pd.DataFrame({'Mass': [586.672, 798.780, 690.780, 400.000 ]})

df=df1.merge(df2,left_on="mass_pos",right_on="Mass")

print(df)

Output :

    entry  mass  Precursor  mass_pos    Mass
0  KGTLPK    48     642.78    690.78  690.78
1  KGTLPK   156     642.78    798.78  798.78

To get the expected result, i did like this :

import pandas as pd

df1 = pd.DataFrame({'entry': ['KGTLPK','KGTLPK','KGTLPK','KGTLPK','KGTLPK' ],
                    'mass': [128 ,48, 112 , 32, 156],
                    'Precursor': [642.780,642.780,642.780,642.780,642.780],
                    'mass_pos': [ 770.780, 690.780, 754.780, 674.780, 798.780]
                    })

df2 = pd.DataFrame({'Mass': [586.672, 798.780, 690.780, 400.000 ]})

DF3 = df1[['mass_pos']].fillna('None')
DF4 = df2[['Mass']].fillna('None')
df1['Merge_Result'] = 'F'
for i in range(DF3.shape[0]):
    for k in range(DF4.shape[0]):        
        if list(DF3.iloc[i]) == list(DF4.iloc[k]):
            df1['Merge_Result'][i] = 'T'
            
print(df1)

Output :

    entry  mass  Precursor  mass_pos Merge_Result
0  KGTLPK   128     642.78    770.78            F
1  KGTLPK    48     642.78    690.78            T
2  KGTLPK   112     642.78    754.78            F
3  KGTLPK    32     642.78    674.78            F
4  KGTLPK   156     642.78    798.78            T