Filter data frame based on numeric vector with "tolerance"-CodePudding

I would like to filter data frame using numeric vector. I am applying function below:

test_data <- exp_data[exp_data$Size_Change %in% vec_data,]

That's how example data looks like:

dput(exp_data)
structure(list(Name = c("Mark", "Greg", "Tomas", "Morka", "Pekka", 
"Robert", "Tim", "Tom", "Bobby", "Terka"), Mode = c(1, 2, NA, 
4, NA, 3, NA, 1, NA, 3), Change = structure(c(6L, 2L, 4L, 5L, 
7L, 7L, 7L, 8L, 3L, 1L), .Label = c("D[ 58], I[ 12][ 385]", "C[ 58], K[ 1206]", 
"C[ 58], P[ 2074]", "C[ 58], K[ 2172]", "C[ 58], K[ 259]", "C[ 58], K[ 2665]", 
"C[ 58], T[ 385]", "C[ 58], C[ 600]"), class = "factor"), Size = c(1335.261, 
697.356, 1251.603, 920.43, 492.236, 393.991, 492.239, 727.696, 
1218.933, 495.237), Place = c(3L, 4L, 3L, 2L, 4L, 5L, 4L, 3L, 
3L, 4L), Size_Change = c(4004, 2786, 3753, 1840, 1966, 1966, 
1966, 2181, 3655, 1978)), row.names = 2049:2058, class = "data.frame")

and vector used for filtering:

dput(vec_data)
c(4003, 2785, 954, 1129, 4013, 756, 1852, 2424, 1954, 246, 147, 
234, 562, 1617, 2180, 888, 1176)

I mentioned about tolerance because vec_data is not very precise and I am expecting 1/-1 difference in numbers and after applying function it will not filter rows with such difference. It may also happen that difference will be 12/-12 or 24/-24. Can I somehow take it into account while filtering ?

Of course probably solution is to do smth like that (vec_data 1) / (vec_data -1) / (vec_data 12), etc. and do couple of filtering attempts and maybe finally rbind outputs of all but I am looking for more "elegant" way. It would also be great if there could be a column added which will indicate how the row was filtered if it was an exact number from vec_data or it was modified by 1, 12, -24 or whatever. Please, take into account that the combination of 1/-1 with any other modification is also possible. Additional column is not necessary if it makes it too complicated.

CodePudding user response：

One option could be (tolerance = 1):

df %>%
    filter(sapply(Size_Change, function(x) any(abs(x - vec) %in% 0:1)))

  Name Mode           Change     Size Place Size_Change
1 Mark    1 C[ 58], K[ 2665] 1335.261     3        4004
2 Greg    2 C[ 58], K[ 1206]  697.356     4        2786
3  Tom    1  C[ 58], C[ 600]  727.696     3        2181

Tolerance = 14:

df %>%
    filter(sapply(Size_Change, function(x) any(abs(x - vec) %in% 0:14)))

    Name Mode           Change     Size Place Size_Change
1   Mark    1 C[ 58], K[ 2665] 1335.261     3        4004
2   Greg    2 C[ 58], K[ 1206]  697.356     4        2786
3  Morka    4  C[ 58], K[ 259]  920.430     2        1840
4  Pekka   NA  C[ 58], T[ 385]  492.236     4        1966
5 Robert    3  C[ 58], T[ 385]  393.991     5        1966
6    Tim   NA  C[ 58], T[ 385]  492.239     4        1966
7    Tom    1  C[ 58], C[ 600]  727.696     3        2181

The same logic with rowwise():

df %>%
    rowwise() %>%
    filter(any(abs(Size_Change - vec) %in% 0:1))

CodePudding user response：

The most obvious methodology is to filter based on inequality rather than exact matched (always recommended when comparing numeric [not integers])

comp <- function(x, yvec, tolerance = 1){
  sapply(x, \(xi){any(abs(xi - yvec) <= tolerance)})
}
exp_data[comp(exp_data$Size_Change, vec_data),]
     Name Mode           Change     Size Place Size_Change
2049 Mark    1 C[ 58], K[ 2665] 1335.261     3        4004
2050 Greg    2 C[ 58], K[ 1206]  697.356     4        2786
2056  Tom    1  C[ 58], C[ 600]  727.696     3        2181
# Tolerance = 2
# exp_data[comp(exp_data$Size_Change, vec_data, 2),]

CodePudding user response：

What about using a tolerance function.

tol <- \(x, tol=1L) sapply(seq(-tol, tol, 1L), \(i) sweep(as.matrix(x), 1L, i))

exp_data[exp_data$Size_Change %in% tol(vec_data), ]

#      Name Mode           Change     Size Place Size_Change
# 2049 Mark    1 C[ 58], K[ 2665] 1335.261     3        4004
# 2050 Greg    2 C[ 58], K[ 1206]  697.356     4        2786
# 2056  Tom    1  C[ 58], C[ 600]  727.696     3        2181

It defaults to tolerance ±1, if we want ±24 we may define it in the argument:

exp_data[exp_data$Size_Change %in% tol(vec_data, 24L), ]
#        Name Mode               Change     Size Place Size_Change
# 2049   Mark    1     C[ 58], K[ 2665] 1335.261     3        4004
# 2050   Greg    2     C[ 58], K[ 1206]  697.356     4        2786
# 2052  Morka    4      C[ 58], K[ 259]  920.430     2        1840
# 2053  Pekka   NA      C[ 58], T[ 385]  492.236     4        1966
# 2054 Robert    3      C[ 58], T[ 385]  393.991     5        1966
# 2055    Tim   NA      C[ 58], T[ 385]  492.239     4        1966
# 2056    Tom    1      C[ 58], C[ 600]  727.696     3        2181
# 2058  Terka    3 D[ 58], I[ 12][ 385]  495.237     4        1978

I you are wondering about the L in 24L, it is integer notation, you may also use tol=24 without any problems.

^{Note: R version 4.1.2 (2021-11-01)}