I have two data.frames A
(dat1
) and B
(dat2
).
Is it possible to find small differences (up to tol
) between one or more numeric columns (cols
) across A
and B
and the replace those in A
with the corresponding ones in B
For example, if you look at the numeric columns across A
and B
, you'll see THIS
in A
for column upper.CL
is 1.770 but the same in B
is 1.771 i.e., they are different by tol = .001
. In this case, all we need is to replace 1.770 in A
with 1.771 from B
so that all numeric columns in A
and B
are the same.
Is it possible to write an R
function to find & use all numeric columns that differ by tol
and replace the values as described above?
foo <- function(dat1, dat2, cols = NULL, tol){
# Solution
}
# EXAMPLE OF USE:
#### foo(dat1 = A, dat2 = B, cols = upper.CL, tol = .002)
# OR
#### foo(dat1 = A, dat2 = B, tol = .002)
A = read.table(h=TRUE, text="
task_dif time emmean SE lower.CL upper.CL
1 complex 1 1.733 0.023 1.686 1.779
2 simple 1 1.734 0.018 1.697 1.770# <- THIS
3 complex 2 1.702 0.025 1.652 1.751
4 simple 2 1.714 0.017 1.680 1.747
5 complex 3 1.757 0.019 1.720 1.794
6 simple 3 1.740 0.027 1.687 1.794
7 complex 4 1.773 0.019 1.735 1.810
8 simple 4 1.764 0.025 1.713 1.814")
B = read.table(h=TRUE, text="
order time emmean SE lower.CL upper.CL
1 c2s 1 1.733 0.023 1.686 1.779
2 s2c 1 1.734 0.018 1.697 1.771# <- THIS
3 c2s 2 1.714 0.017 1.680 1.747
4 s2c 2 1.702 0.025 1.652 1.751
5 c2s 3 1.757 0.019 1.720 1.794
6 s2c 3 1.740 0.027 1.687 1.794
7 c2s 4 1.764 0.025 1.713 1.814
8 s2c 4 1.773 0.019 1.735 1.810")
Desired output:
A = read.table(h=TRUE, text="
task_dif time emmean SE lower.CL upper.CL
1 complex 1 1.733 0.023 1.686 1.779
2 simple 1 1.734 0.018 1.697 1.771# <- Replaced using corresponding value in `B`
3 complex 2 1.702 0.025 1.652 1.751
4 simple 2 1.714 0.017 1.680 1.747
5 complex 3 1.757 0.019 1.720 1.794
6 simple 3 1.740 0.027 1.687 1.794
7 complex 4 1.773 0.019 1.735 1.810
8 simple 4 1.764 0.025 1.713 1.814")
B = read.table(h=TRUE, text="
order time emmean SE lower.CL upper.CL
1 c2s 1 1.733 0.023 1.686 1.779
2 s2c 1 1.734 0.018 1.697 1.771# <- THIS
3 c2s 2 1.714 0.017 1.680 1.747
4 s2c 2 1.702 0.025 1.652 1.751
5 c2s 3 1.757 0.019 1.720 1.794
6 s2c 3 1.740 0.027 1.687 1.794
7 c2s 4 1.764 0.025 1.713 1.814
8 s2c 4 1.773 0.019 1.735 1.810")
CodePudding user response:
Try this:
library(dplyr)
close <- function(tol) function(a, b) abs(a - b) < tol
mutate(A, rn = row_number()) %>%
fuzzyjoin::fuzzy_left_join(
select(B, upper.CL),
by = "upper.CL",
match_fun = list(close(0.001))) %>%
distinct(rn, upper.CL.x, upper.CL.y, .keep_all = TRUE) %>%
select(-upper.CL.x, upper.CL = upper.CL.y)
# task_dif time emmean SE lower.CL rn upper.CL
# 1 complex 1 1.733 0.023 1.686 1 1.779
# 2 simple 1 1.734 0.018 1.697 2 1.771
# 3 complex 2 1.702 0.025 1.652 3 1.751
# 4 simple 2 1.714 0.017 1.680 4 1.747
# 5 complex 3 1.757 0.019 1.720 5 1.794
# 6 simple 3 1.740 0.027 1.687 6 1.794
# 7 complex 4 1.773 0.019 1.735 7 1.810
# 8 simple 4 1.764 0.025 1.713 8 1.814
Notes:
I add
rn
in the likely condition of a 1-to-many join; if you look at the data beforedistinct(..)
above, you'll see that rows 5 and 6 are repeated, which makes sense since1.794
occurs twice each inA
andB
(though twice in one alone is sufficient for 1-to-many). Your real data might introduce more duplicate rows thatdistinct
do not address, you can usern
to summarize/aggregate or reconstruct as needed.Joining on a floating-point number is not guaranteed; while it should likely work well enough with this data, the issue is never an error: failures due to floating-point equality concerns will merely evidence as "did not join", which is not an error nor even a warning. The
tol=
should address most of that, but caveat emptor.It seems likely that this join might also need more join-columns. For those, within
fuzzy_left_join
one would include`==`
as the match function. For instance, iftime
were also a join on equality (just work with me here), thenmutate(A, rn = row_number()) %>% fuzzyjoin::fuzzy_left_join( select(B, time, upper.CL), by = c("time", "upper.CL"), match_fun = list(`==`, close(0.001))) %>% ...
(If not clear, those are backticks, not single-quotes.)
my
close
is a function that returns a function ... that may seem too meta, but it works well in a more generalized fashion here.match_fun
must be a function (either anonymous or named) or a~
-function (rlang-style). I think this looks more readable than the equalalentmatch_fun=list(~ abs(.x - .y) < 0.001)
, though that works as well.
In the end, I think this function might meet your needs.
close <- function(tol) function(a, b) abs(a - b) < tol
myjoin <- function(X, Y, by = NULL, tol = 0,
type = c("full", "left", "right"), reduce = TRUE) {
if (is.null(names(by))) names(by) <- by
stopifnot(
all(names(by) %in% names(X)),
all(by %in% names(Y)),
all(!is.na(tol) & tol >= 0)
)
type <- match.arg(type)
if (length(tol) == 1L) tol <- rep(tol, length(by))
funs <- lapply(tol, function(z) if (tol < 1e-15) `==` else close(tol))
joinfun <- switch(
type,
full = fuzzyjoin::fuzzy_full_join,
left = fuzzyjoin::fuzzy_left_join,
right = fuzzyjoin::fuzzy_right_join)
out <- joinfun(
transform(X, rn = seq_len(nrow(X))),
subset(Y, select = by),
by = by, match_fun = funs)
if (reduce) {
samename <- (names(by) == by)
byx <- paste0(names(by), ifelse(samename, ".x", ""))
byy <- paste0(by, ifelse(samename, ".y", ""))
out <- out[!duplicated(out[, unique(c("rn", byx, byy))]),]
rownames(out) <- NULL
# ASSUMING you always want to replace the LHS 'by' columns with the
# RHS columns ...
out[names(by)] <- out[byy]
out[c(byx[samename], byy)] <- NULL
}
out
}
myjoin(A, B, by = "upper.CL", tol = 0.001, type = "left")
# task_dif time emmean SE lower.CL rn upper.CL
# 1 complex 1 1.733 0.023 1.686 1 1.779
# 2 simple 1 1.734 0.018 1.697 2 1.771
# 3 complex 2 1.702 0.025 1.652 3 1.751
# 4 simple 2 1.714 0.017 1.680 4 1.747
# 5 complex 3 1.757 0.019 1.720 5 1.794
# 6 simple 3 1.740 0.027 1.687 6 1.794
# 7 complex 4 1.773 0.019 1.735 7 1.810
# 8 simple 4 1.764 0.025 1.713 8 1.814
myjoin(A, B, by = "upper.CL", tol = 0.001, type = "left", reduce = FALSE)
# task_dif time emmean SE lower.CL upper.CL.x rn upper.CL.y
# 1 complex 1 1.733 0.023 1.686 1.779 1 1.779
# 2 simple 1 1.734 0.018 1.697 1.770 2 1.771
# 3 complex 2 1.702 0.025 1.652 1.751 3 1.751
# 4 simple 2 1.714 0.017 1.680 1.747 4 1.747
# 5 complex 3 1.757 0.019 1.720 1.794 5 1.794
# 6 complex 3 1.757 0.019 1.720 1.794 5 1.794
# 7 simple 3 1.740 0.027 1.687 1.794 6 1.794
# 8 simple 3 1.740 0.027 1.687 1.794 6 1.794
# 9 complex 4 1.773 0.019 1.735 1.810 7 1.810
# 10 simple 4 1.764 0.025 1.713 1.814 8 1.814
CodePudding user response:
this might not be the fastest since equal values also get reassigned but i think its the most straight forward implementation.
foo <- function(dat1,dat2,tol) {
## Filter numerics
O<-lapply(list(dat1,dat2),\(x) Filter(is.numeric,x))
# flag differences based on tol
ERR<-(abs(O[[1]]-O[[2]])<=tol)
# reassign
dat2[names(O[[2]])][ERR] <- O[[1]][ERR]
dat2
}
foo(A,B,.001)
CodePudding user response:
If order and the size are the same, you can simply cbind(A,B[,6])
CodePudding user response:
Hope this solves your problem:
foo <- function(A, B, cols = NULL, tol) {
if (!is.null(cols)) {
C <- abs(A[cols]-B[cols])
idx <- C > tol
idx
for (i in 1:nrow(A)) {
if (idx[i]) {
A[i,cols] <- B[i,cols]
}
}
return(A)
}
num <- unlist(lapply(A, is.numeric), use.names = FALSE)
nn <- unlist(lapply(A, function(x) !is.numeric(x)), use.names = FALSE)
A_num <- A[,num]
B_num <- B[,num]
A_nn <- A[,nn]
C <- abs(A_num-B_num)
idx <- C > tol
for (i in 1:nrow(A_num)) {
for (j in 1:ncol(A_num)) {
if (idx[i,j]) {
A_num[i,j] <- B_num[i,j]
}
}
}
A_new <- cbind(A_nn, A_num)
A_new
}