R function for comparing data within a dataframe-CodePudding

> head(peaks)
       m.z Height    RT
1  84.9594 358909 0.219
2 214.9169 111512 0.223
3  56.9418 168197 0.261
4 201.8865  26736 0.352
5 122.9683  20625 0.465
6  84.9594 343573 0.854

I have a dataframe, where I want to search the m.z column for any pairs that differ between 4 and 6. The column is over 700 datapoints long, so it is not an option to search for pairs for each datapoint seperately. Is there a function or package which makes it possible to analyze the whole column for pairs at once?

I only discovered functions for the comparison of two dataframes

CodePudding user response：

If I understand you correctly, you wish to find any pairs of elements in the m.z column whose value differs by between 4 and 6. You can use dist to get such pairwise comparisons:

m <- as.matrix(dist(peaks$m.z, diag = TRUE, upper = TRUE))

m
#>          1        2        3        4        5        6
#> 1   0.0000 129.9575  28.0176 116.9271  38.0089   0.0000
#> 2 129.9575   0.0000 157.9751  13.0304  91.9486 129.9575
#> 3  28.0176 157.9751   0.0000 144.9447  66.0265  28.0176
#> 4 116.9271  13.0304 144.9447   0.0000  78.9182 116.9271
#> 5  38.0089  91.9486  66.0265  78.9182   0.0000  38.0089
#> 6   0.0000 129.9575  28.0176 116.9271  38.0089   0.0000

To find which distances are between 4 and 6, we would do:

which(m > 4 & m < 6, arr.ind = TRUE)
#> row  col

This means that none of the values in your sample are between 4 and 6 units apart. If such a pair did exist, you will get a 2-column array where each row represents a pair meeting your criteria.

We can see this by changing the first element of m.z to 61, so that it is between 4 and 6 away from the value in the 3rd row of your data frame:

peaks$m.z[1] <- 61

m <- as.matrix(dist(peaks$m.z, diag = TRUE, upper = TRUE))

which(m > 4 & m < 6, arr.ind = TRUE)
#>   row col
#> 3   3   1
#> 1   1   3

CodePudding user response：

This approach yields a data frame that matches the pairings on indices (the row number). It's a fair bit slower than Allan's solution, but may be worth it if you prefer the resulting structure over the matrix (purely a stylistic matter). I've added a row of data to your example that illustrates what it looks like when pairings exist.

peaks <- data.frame(m.z = c(84.9594, 214.9169, 56.9418, 201.8865, 122.9683, 84.9594, 89.000), 
                    Height = c(358909, 111512, 168197, 26736, 20625, 343573, 343577), 
                    RT = c(0.219, 0.223, 0.261, 0.352, 0.465, 0.854, 0.900))

# Assign an index to each row
peaks$index <- seq_len(nrow(peaks))

# For each row, get the pairwise differences with all other rows
# Make an logical value indicating if the difference is between 4 and 6
# It also adds a 'reference index' which indicates which row is being compared to the others
peaks_compare <- 
  lapply(seq_len(nrow(peaks)), 
         function(i){
           diff <- abs(peaks$m.z[i] - peaks$m.z) 
           peaks$ref_index <- rep(i, nrow(peaks))
           peaks$in_range <- diff >= 4 & diff <= 6
           peaks
         })

# put all the data frames together and filter down to just those where
# the difference is in the target range.
peaks_compare <- do.call("rbind", peaks_compare)
peaks_compare <- peaks_compare[peaks_compare$in_range, ]

# Order the data frame 
peaks_compare <- peaks_compare[order(peaks_compare$ref_index, 
                                     peaks_compare$index), ]

peaks_compare
#>        m.z Height    RT index ref_index in_range
#> 7  89.0000 343577 0.900     7         1     TRUE
#> 42 89.0000 343577 0.900     7         6     TRUE
#> 43 84.9594 358909 0.219     1         7     TRUE
#> 48 84.9594 343573 0.854     6         7     TRUE