Home > database >  Substitute specific rows based on another dataframe in R
Substitute specific rows based on another dataframe in R

Time:06-30

Lets say I have:

# Create a, b, c, d variables
x1 <- c("g", "a","c","d","e","f","h", "b")
x2 <- c(1,1,1,1,1,1,1,1)
x7 <- c(10,10,10,10,10,10, 10, 10)

# Join the variables to create a data frame
dataframeA <- data.frame(x1, x2, x7)


# Create a, b, c, d variables
x3 <- c("z",  "k" ,"a", "b","c","d")
x4 <- c(5, 19, 6,7,8,9)


# Join the variables to create a data frame
 dataframeB <- data.frame(x3, x4)

And I want to substitute values on column x2 of dataframe A with values of column x4 on dataframe b based on matching of a vector, such as dataframe A be:

matchingids = c("a", "b")

dataframeA$x2[which(dataframeA$x1 %in% matchingids)] <- dataframeB$x4[which(dataframeB$x3 %in% matchingids)]

dataframeA turns to:

structure(list(x1 = c("g", "a", "c", "d", "e", "f", "h", "b"), 
    x2 = c(1, 6, 1, 1, 1, 1, 1, 7), x7 = c(10, 10, 10, 10, 10, 
    10, 10, 10)), row.names = c(NA, -8L), class = "data.frame")

Which works, but then,

# Create a, b, c, d variables
x1 <- c("g", "a","c","d","e","f","h", "b")
x2 <- c(1,1,1,1,1,1,1,1)
x7 <- c(10,10,10,10,10,10, 10, 10)

# Join the variables to create a data frame
dataframeA <- data.frame(x1, x2, x7)

(here i changed "b" and "a" order
# Create a, b, c, d variables
x3 <- c("z",  "k" ,"b", "a","c","d")
x4 <- c(5, 19, 6,7,8,9)


# Join the variables to create a data frame
 dataframeB <- data.frame(x3, x4)



matchingids = c("a", "b")

dataframeA$x2[which(dataframeA$x1 %in% matchingids)] <- dataframeB$x4[which(dataframeB$x3 %in% matchingids)]

which gives:

structure(list(x1 = c("g", "a", "c", "d", "e", "f", "h", "b"), 
    x2 = c(1, 6, 1, 1, 1, 1, 1, 7), x7 = c(10, 10, 10, 10, 10, 
    10, 10, 10)), row.names = c(NA, -8L), class = "data.frame")

Which does not work, because it is substituing a on first dataframe to b in the second dataframe (order of the objects is wrong) In the second case, a is being change to b in the first dataframe (value should be a = 7, b = 6) As you can observe, i get the same result even if I change the position of "a" in dataframeB

CodePudding user response:

This seems like a merge/join operation.

### base R
merge(dataframeA, subset(dataframeB, x3 %in% matchingids),
      by.x="x1", by.y="x3", all.x=TRUE) |>
  transform(x2 = ifelse(is.na(x4), x2, x4)) |>
  subset(select = -x4)
#   x1 x2 x7
# 1  a  6 10
# 2  b  7 10
# 3  g  1 10
# 4  c  1 10
# 5  d  1 10
# 6  e  1 10
# 7  f  1 10
# 8  h  1 10

### dplyr
library(dplyr)
filter(dataframeB, x3 %in% matchingids) %>%
  right_join(dataframeA, by = c("x3"="x1")) %>%
  mutate(x2 = coalesce(x4, x2)) %>%
  select(-x4)
#   x3 x2 x7
# 1  a  6 10
# 2  b  7 10
# 3  g  1 10
# 4  c  1 10
# 5  d  1 10
# 6  e  1 10
# 7  f  1 10
# 8  h  1 10

(FYI, base::merge doesn't do a good job preserving the original order. If it is very important, I suggest you preface that code with adding a row-number field, then sorting post-merge on that field. Adding sort=FALSE to base::merge does not solve it for me.)

Similarly with the second sets of frames:

merge(dataframeA, subset(dataframeB, x3 %in% matchingids),
      by.x="x1", by.y="x3", all.x=TRUE) |>
  transform(x2 = ifelse(is.na(x4), x2, x4)) |>
  subset(select = -x4)
#   x1 x2 x7
# 1  a  7 10
# 2  b  6 10
# 3  g  1 10
# 4  c  1 10
# 5  d  1 10
# 6  e  1 10
# 7  f  1 10
# 8  h  1 10

filter(dataframeB, x3 %in% matchingids) %>%
  right_join(dataframeA, by = c("x3"="x1")) %>%
  mutate(x2 = coalesce(x4, x2)) %>%
  select(-x4)
#   x3 x2 x7
# 1  b  6 10
# 2  a  7 10
# 3  g  1 10
# 4  c  1 10
# 5  d  1 10
# 6  e  1 10
# 7  f  1 10
# 8  h  1 10

Note: the |> is in R-4 and later. If you're on an earlier version, you'll need to shift to use intermediate objects.

For more discussions about the concepts of merge/join, see: How to join (merge) data frames (inner, outer, left, right), What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?, (pandas) Pandas Merging 101. It's a very powerful process and can pay huge dividends once you become more comfortable with using it.

CodePudding user response:

It works with:

a$x2[order(a$x1)][which(a$x1[order(a$x1)] %in% matchingids)] <-  b$x4[order(b$x3)][which(b$x3[order(b$x3)] %in% matchingids)]

But there might be plobems with it, mainly when

 matchingids

have IDs that don't match dataframeA or dataframeB, or neither. If the number of IDs are different from dataframe to dataframe, it will also not work

Might only work when dataframeA and dataframeB contains all

matchingids
  • Related