Home > Mobile >  Reporting closest value using np.isclose in pandas dataframe
Reporting closest value using np.isclose in pandas dataframe

Time:12-25

I currently have two DataFrames, one which has a list of masses (listed as column 'mass_pos'):

        entry  mass Precursor  Monoisotopic  mass_pos  masses match
0       KGTLP   110     KGTLP        581.69    691.69          True
1       KGTLP   125     KGTLP        581.69    706.69          True
2       KGTLP   133     KGTLP        581.69    714.69          True
3       KGTLP   139     KGTLP        581.69    720.69          True
4       KGTLP   153     KGTLP        581.69    734.69          True
      ...   ...       ...           ...       ...           ...
355675  GTKKP    42     GTKKP        596.70    638.70          True
355676  GTKKP    43     GTKKP        596.70    639.70          True
355677  GTKKP   210     GTKKP        596.70    806.70          True
355678  GTKKP   226     GTKKP        596.70    822.70          True
355679  GTKKP     0     GTKKP        596.70    596.70          True

The other DataFrame looks like this:

      Mass
0  586.672
1  798.780
2  690.780
3  400.000
4  662.000

As you can see, I used np.isclose to see if there is a value in the second DataFrame that is within a certain tolerance of the 'mass_pos' value in the first DataFrame, and then the boolean is appended to the first df. This is how I did that:

tolerance = tol_in #provides margin of error
match_mass = lambda x: np.any(np.isclose(x, mass_q_sequence['Mass'], atol=tolerance))
df_seq2['masses match'] = df_seq2['mass_pos'].apply(match_mass)
df_seq2 = df_seq2[df_seq2['masses match'] == True] #remove all false rows from df

I have come to realize that I need to calculate a ppm error, which involves finding the error between the 'mass pos' and 'mass' values, so the simple boolean output no longer suffices. Is there a way to either report the difference between these values, or append the matched value from the second df to the first df that satisfies the boolean?

Essentially I just need to report what value from the second df satisfied the boolean in the first.

CodePudding user response:

If I got it correctly, you just want to find the closest value from second dataframe.

masses = mass_q_sequence['Mass']
mass_pos = df_seq2['mass_pos']
# using broadcasting and finding indices of closest mass for each mass_pos:
closest_mass_indices = np.argmin(np.abs(masses.reshape(1, -1) - mass_pos.reshape(-1, 1)), axis=1) 
df['closest_mass'] = masses[closest_mass_indices]
  • Related