> head(peaks)
m.z Height RT
1 84.9594 358909 0.219
2 214.9169 111512 0.223
3 56.9418 168197 0.261
4 201.8865 26736 0.352
5 122.9683 20625 0.465
6 84.9594 343573 0.854
I have a dataframe, where I want to search the m.z column for any pairs that differ between 4 and 6. The column is over 700 datapoints long, so it is not an option to search for pairs for each datapoint seperately. Is there a function or package which makes it possible to analyze the whole column for pairs at once?
I only discovered functions for the comparison of two dataframes
CodePudding user response:
If I understand you correctly, you wish to find any pairs of elements in the m.z
column whose value differs by between 4 and 6. You can use dist
to get such pairwise comparisons:
m <- as.matrix(dist(peaks$m.z, diag = TRUE, upper = TRUE))
m
#> 1 2 3 4 5 6
#> 1 0.0000 129.9575 28.0176 116.9271 38.0089 0.0000
#> 2 129.9575 0.0000 157.9751 13.0304 91.9486 129.9575
#> 3 28.0176 157.9751 0.0000 144.9447 66.0265 28.0176
#> 4 116.9271 13.0304 144.9447 0.0000 78.9182 116.9271
#> 5 38.0089 91.9486 66.0265 78.9182 0.0000 38.0089
#> 6 0.0000 129.9575 28.0176 116.9271 38.0089 0.0000
To find which distances are between 4 and 6, we would do:
which(m > 4 & m < 6, arr.ind = TRUE)
#> row col
This means that none of the values in your sample are between 4 and 6 units apart. If such a pair did exist, you will get a 2-column array where each row represents a pair meeting your criteria.
We can see this by changing the first element of m.z
to 61, so that it is between 4 and 6 away from the value in the 3rd row of your data frame:
peaks$m.z[1] <- 61
m <- as.matrix(dist(peaks$m.z, diag = TRUE, upper = TRUE))
which(m > 4 & m < 6, arr.ind = TRUE)
#> row col
#> 3 3 1
#> 1 1 3
CodePudding user response:
This approach yields a data frame that matches the pairings on indices (the row number). It's a fair bit slower than Allan's solution, but may be worth it if you prefer the resulting structure over the matrix (purely a stylistic matter). I've added a row of data to your example that illustrates what it looks like when pairings exist.
peaks <- data.frame(m.z = c(84.9594, 214.9169, 56.9418, 201.8865, 122.9683, 84.9594, 89.000),
Height = c(358909, 111512, 168197, 26736, 20625, 343573, 343577),
RT = c(0.219, 0.223, 0.261, 0.352, 0.465, 0.854, 0.900))
# Assign an index to each row
peaks$index <- seq_len(nrow(peaks))
# For each row, get the pairwise differences with all other rows
# Make an logical value indicating if the difference is between 4 and 6
# It also adds a 'reference index' which indicates which row is being compared to the others
peaks_compare <-
lapply(seq_len(nrow(peaks)),
function(i){
diff <- abs(peaks$m.z[i] - peaks$m.z)
peaks$ref_index <- rep(i, nrow(peaks))
peaks$in_range <- diff >= 4 & diff <= 6
peaks
})
# put all the data frames together and filter down to just those where
# the difference is in the target range.
peaks_compare <- do.call("rbind", peaks_compare)
peaks_compare <- peaks_compare[peaks_compare$in_range, ]
# Order the data frame
peaks_compare <- peaks_compare[order(peaks_compare$ref_index,
peaks_compare$index), ]
peaks_compare
#> m.z Height RT index ref_index in_range
#> 7 89.0000 343577 0.900 7 1 TRUE
#> 42 89.0000 343577 0.900 7 6 TRUE
#> 43 84.9594 358909 0.219 1 7 TRUE
#> 48 84.9594 343573 0.854 6 7 TRUE