I currently have two DataFrames, one which has a list of masses (listed as column 'mass_pos'
):
entry mass Precursor Monoisotopic mass_pos masses match
0 KGTLP 110 KGTLP 581.69 691.69 True
1 KGTLP 125 KGTLP 581.69 706.69 True
2 KGTLP 133 KGTLP 581.69 714.69 True
3 KGTLP 139 KGTLP 581.69 720.69 True
4 KGTLP 153 KGTLP 581.69 734.69 True
... ... ... ... ... ...
355675 GTKKP 42 GTKKP 596.70 638.70 True
355676 GTKKP 43 GTKKP 596.70 639.70 True
355677 GTKKP 210 GTKKP 596.70 806.70 True
355678 GTKKP 226 GTKKP 596.70 822.70 True
355679 GTKKP 0 GTKKP 596.70 596.70 True
The other DataFrame looks like this:
Mass
0 586.672
1 798.780
2 690.780
3 400.000
4 662.000
As you can see, I used np.isclose
to see if there is a value in the second DataFrame that is within a certain tolerance of the 'mass_pos'
value in the first DataFrame, and then the boolean is appended to the first df
. This is how I did that:
tolerance = tol_in #provides margin of error
match_mass = lambda x: np.any(np.isclose(x, mass_q_sequence['Mass'], atol=tolerance))
df_seq2['masses match'] = df_seq2['mass_pos'].apply(match_mass)
df_seq2 = df_seq2[df_seq2['masses match'] == True] #remove all false rows from df
I have come to realize that I need to calculate a ppm error, which involves finding the error between the 'mass pos'
and 'mass'
values, so the simple boolean output no longer suffices. Is there a way to either report the difference between these values, or append the matched value from the second df to the first df that satisfies the boolean?
Essentially I just need to report what value from the second df satisfied the boolean in the first.
CodePudding user response:
If I got it correctly, you just want to find the closest value from second dataframe.
masses = mass_q_sequence['Mass']
mass_pos = df_seq2['mass_pos']
# using broadcasting and finding indices of closest mass for each mass_pos:
closest_mass_indices = np.argmin(np.abs(masses.reshape(1, -1) - mass_pos.reshape(-1, 1)), axis=1)
df['closest_mass'] = masses[closest_mass_indices]