Home > Software engineering >  Replace values of one dataframe from the corresponding values from another dataframe in R
Replace values of one dataframe from the corresponding values from another dataframe in R

Time:02-03

I have two data.frames A (dat1) and B (dat2).

Is it possible to find small differences (up to tol) between one or more numeric columns (cols) across A and B and the replace those in A with the corresponding ones in B

For example, if you look at the numeric columns across A and B, you'll see THIS in A for column upper.CL is 1.770 but the same in B is 1.771 i.e., they are different by tol = .001. In this case, all we need is to replace 1.770 in A with 1.771 from B so that all numeric columns in A and B are the same.

Is it possible to write an R function to find & use all numeric columns that differ by tol and replace the values as described above?

foo <- function(dat1, dat2, cols = NULL, tol){

# Solution
} 

# EXAMPLE OF USE:
#### foo(dat1 = A, dat2 = B, cols = upper.CL, tol = .002)
# OR
#### foo(dat1 = A, dat2 = B, tol = .002)
A = read.table(h=TRUE, text="
  task_dif time emmean    SE lower.CL upper.CL
1  complex    1  1.733 0.023 1.686    1.779
2   simple    1  1.734 0.018 1.697    1.770# <- THIS
3  complex    2  1.702 0.025 1.652    1.751
4   simple    2  1.714 0.017 1.680    1.747
5  complex    3  1.757 0.019 1.720    1.794
6   simple    3  1.740 0.027 1.687    1.794
7  complex    4  1.773 0.019 1.735    1.810
8   simple    4  1.764 0.025 1.713    1.814")

B = read.table(h=TRUE, text="
  order time emmean    SE lower.CL upper.CL
1   c2s    1  1.733 0.023 1.686    1.779
2   s2c    1  1.734 0.018 1.697    1.771# <- THIS
3   c2s    2  1.714 0.017 1.680    1.747
4   s2c    2  1.702 0.025 1.652    1.751
5   c2s    3  1.757 0.019 1.720    1.794
6   s2c    3  1.740 0.027 1.687    1.794
7   c2s    4  1.764 0.025 1.713    1.814
8   s2c    4  1.773 0.019 1.735    1.810")

Desired output:

A = read.table(h=TRUE, text="
  task_dif time emmean    SE lower.CL upper.CL
1  complex    1  1.733 0.023 1.686    1.779
2   simple    1  1.734 0.018 1.697    1.771# <- Replaced using corresponding value in `B`
3  complex    2  1.702 0.025 1.652    1.751
4   simple    2  1.714 0.017 1.680    1.747
5  complex    3  1.757 0.019 1.720    1.794
6   simple    3  1.740 0.027 1.687    1.794
7  complex    4  1.773 0.019 1.735    1.810
8   simple    4  1.764 0.025 1.713    1.814")

B = read.table(h=TRUE, text="
  order time emmean    SE lower.CL upper.CL
1   c2s    1  1.733 0.023 1.686    1.779
2   s2c    1  1.734 0.018 1.697    1.771# <- THIS
3   c2s    2  1.714 0.017 1.680    1.747
4   s2c    2  1.702 0.025 1.652    1.751
5   c2s    3  1.757 0.019 1.720    1.794
6   s2c    3  1.740 0.027 1.687    1.794
7   c2s    4  1.764 0.025 1.713    1.814
8   s2c    4  1.773 0.019 1.735    1.810")

CodePudding user response:

Try this:

library(dplyr)
close <- function(tol) function(a, b) abs(a - b) < tol
mutate(A, rn = row_number()) %>%
  fuzzyjoin::fuzzy_left_join(
    select(B, upper.CL), 
    by = "upper.CL", 
    match_fun = list(close(0.001))) %>%
  distinct(rn, upper.CL.x, upper.CL.y, .keep_all = TRUE) %>%
  select(-upper.CL.x, upper.CL = upper.CL.y)
#   task_dif time emmean    SE lower.CL rn upper.CL
# 1  complex    1  1.733 0.023    1.686  1    1.779
# 2   simple    1  1.734 0.018    1.697  2    1.771
# 3  complex    2  1.702 0.025    1.652  3    1.751
# 4   simple    2  1.714 0.017    1.680  4    1.747
# 5  complex    3  1.757 0.019    1.720  5    1.794
# 6   simple    3  1.740 0.027    1.687  6    1.794
# 7  complex    4  1.773 0.019    1.735  7    1.810
# 8   simple    4  1.764 0.025    1.713  8    1.814

Notes:

  • I add rn in the likely condition of a 1-to-many join; if you look at the data before distinct(..) above, you'll see that rows 5 and 6 are repeated, which makes sense since 1.794 occurs twice each in A and B (though twice in one alone is sufficient for 1-to-many). Your real data might introduce more duplicate rows that distinct do not address, you can use rn to summarize/aggregate or reconstruct as needed.

  • Joining on a floating-point number is not guaranteed; while it should likely work well enough with this data, the issue is never an error: failures due to floating-point equality concerns will merely evidence as "did not join", which is not an error nor even a warning. The tol= should address most of that, but caveat emptor.

  • It seems likely that this join might also need more join-columns. For those, within fuzzy_left_join one would include `==` as the match function. For instance, if time were also a join on equality (just work with me here), then

    mutate(A, rn = row_number()) %>%
      fuzzyjoin::fuzzy_left_join(
        select(B, time, upper.CL),
        by = c("time", "upper.CL"),
        match_fun = list(`==`, close(0.001))) %>%
      ...
    

    (If not clear, those are backticks, not single-quotes.)

  • my close is a function that returns a function ... that may seem too meta, but it works well in a more generalized fashion here. match_fun must be a function (either anonymous or named) or a ~-function (rlang-style). I think this looks more readable than the equalalent match_fun=list(~ abs(.x - .y) < 0.001), though that works as well.


In the end, I think this function might meet your needs.

close <- function(tol) function(a, b) abs(a - b) < tol
myjoin <- function(X, Y, by = NULL, tol = 0,
                   type = c("full", "left", "right"), reduce = TRUE) {
  if (is.null(names(by))) names(by) <- by
  stopifnot(
    all(names(by) %in% names(X)),
    all(by %in% names(Y)),
    all(!is.na(tol) & tol >= 0)
  )
  type <- match.arg(type)
  if (length(tol) == 1L) tol <- rep(tol, length(by))
  funs <- lapply(tol, function(z) if (tol < 1e-15) `==` else close(tol))
  joinfun <- switch(
    type,
    full = fuzzyjoin::fuzzy_full_join,
    left = fuzzyjoin::fuzzy_left_join,
    right = fuzzyjoin::fuzzy_right_join)
  out <- joinfun(
    transform(X, rn = seq_len(nrow(X))),
    subset(Y, select = by),
    by = by, match_fun = funs)
  if (reduce) {
    samename <- (names(by) == by)
    byx <- paste0(names(by), ifelse(samename, ".x", ""))
    byy <- paste0(by, ifelse(samename, ".y", ""))
    out <- out[!duplicated(out[, unique(c("rn", byx, byy))]),]
    rownames(out) <- NULL
    # ASSUMING you always want to replace the LHS 'by' columns with the
    # RHS columns ...
    out[names(by)] <- out[byy]
    out[c(byx[samename], byy)] <- NULL
  }
  out
}

myjoin(A, B, by = "upper.CL", tol = 0.001, type = "left")
#   task_dif time emmean    SE lower.CL rn upper.CL
# 1  complex    1  1.733 0.023    1.686  1    1.779
# 2   simple    1  1.734 0.018    1.697  2    1.771
# 3  complex    2  1.702 0.025    1.652  3    1.751
# 4   simple    2  1.714 0.017    1.680  4    1.747
# 5  complex    3  1.757 0.019    1.720  5    1.794
# 6   simple    3  1.740 0.027    1.687  6    1.794
# 7  complex    4  1.773 0.019    1.735  7    1.810
# 8   simple    4  1.764 0.025    1.713  8    1.814
myjoin(A, B, by = "upper.CL", tol = 0.001, type = "left", reduce = FALSE)
#    task_dif time emmean    SE lower.CL upper.CL.x rn upper.CL.y
# 1   complex    1  1.733 0.023    1.686      1.779  1      1.779
# 2    simple    1  1.734 0.018    1.697      1.770  2      1.771
# 3   complex    2  1.702 0.025    1.652      1.751  3      1.751
# 4    simple    2  1.714 0.017    1.680      1.747  4      1.747
# 5   complex    3  1.757 0.019    1.720      1.794  5      1.794
# 6   complex    3  1.757 0.019    1.720      1.794  5      1.794
# 7    simple    3  1.740 0.027    1.687      1.794  6      1.794
# 8    simple    3  1.740 0.027    1.687      1.794  6      1.794
# 9   complex    4  1.773 0.019    1.735      1.810  7      1.810
# 10   simple    4  1.764 0.025    1.713      1.814  8      1.814

CodePudding user response:

this might not be the fastest since equal values also get reassigned but i think its the most straight forward implementation.

foo <- function(dat1,dat2,tol) {
  ## Filter numerics
  O<-lapply(list(dat1,dat2),\(x) Filter(is.numeric,x)) 
  # flag differences based on tol
  ERR<-(abs(O[[1]]-O[[2]])<=tol) 
  # reassign
  dat2[names(O[[2]])][ERR]  <- O[[1]][ERR]
  dat2
}

foo(A,B,.001) 

CodePudding user response:

If order and the size are the same, you can simply cbind(A,B[,6])

CodePudding user response:

Hope this solves your problem:

foo <- function(A, B, cols = NULL, tol) {
  
  if (!is.null(cols)) {
    C <- abs(A[cols]-B[cols])
    idx <- C > tol
    idx
    
    for (i in 1:nrow(A)) {
      if (idx[i]) {
        A[i,cols] <- B[i,cols]
      }
    }
    
    return(A)
  }
  
  num <- unlist(lapply(A, is.numeric), use.names = FALSE)  
  nn <- unlist(lapply(A, function(x) !is.numeric(x)), use.names = FALSE)  
  
  A_num <- A[,num]
  B_num <- B[,num]
  
  A_nn <- A[,nn]
  
  C <- abs(A_num-B_num)
  
  idx <- C > tol
  
  for (i in 1:nrow(A_num)) {
    for (j in 1:ncol(A_num)) {
      if (idx[i,j]) {
        A_num[i,j] <- B_num[i,j]
      }
    }
  }
  
  A_new <- cbind(A_nn, A_num)
  A_new
}
  • Related