Basically, I have a dataframe that has some data that need to be replaced with data in a different dataframe. Some of the observations need to be changed in only a few of many columns.
To illustrate, say I have a dataframe:
df1 <- data.frame(index = c('a', 'b', 'c', 'd', 'e'),
var1 = c(40, 22, 12, 4, 0),
var2 = c(75, 55, 65, 15, 0),
var3 = c(9, 18, 81, 3, 0),
var4 = c(1, 11, 21, 61, 0),
var5 = c(-1, -2, -3, -4, 0),
var6 = c(0, 1, 0, 1, 0))
index var1 var2 var3 var4 var5 var6
1 a 40 75 9 1 -1 0
2 b 22 55 18 11 -2 1
3 c 12 65 81 21 -3 0
4 d 4 15 3 61 -4 1
5. e 0 0 0 0 0 0
Then, some better data for observations of "b" and "c" come along, but only for variables "var2" and "var3":
df2 <- data.frame(index = c('a', 'b', 'c', 'd'),
var1 = c(40, 22, 12, 4),
var2 = c(75, 550, 650, 15),
var3 = c(9, 180, 810, 3),
var4 = c(1, 11, 21, 61))
index var1 var2 var3 var4
1 a 40 75 9 1
2 b 22 550 180 11
3 c 12 650 810 21
4 d 4 15 3 61
I want the resulting data frame:
index var1 var2 var3 var4 var5 var6
1 a 40 75 9 1 -1 0
2 b 22 550 180 11 -2 1
3 c 12 650 810 21 -3 0
4 d 4 15 3 61 -4 1
5 e 0 0 0 0 0 0
Any tips (particularly any tidyverse solutions to offfer?) I have played around with mutate
, across
, and replace
but can't seem to get it.
Clarification: I'd like the new values for var2 and var3 to be replaced with those from df2 wherever the index values are found in df1. Everything else should stay the same. The dataset I am working on has the index values show up many times. I tried the following, but something is wrong with the syntax:
df1 %>%
mutate(across(c(var2, var3),
~if_else(index %in% df2$index),
.[df2$index],
.))
CodePudding user response:
Update II: after new clarification: removed the previous answer:
library(dplyr)
bind_rows(df1, df2) %>%
arrange(index) %>%
mutate(across(var1:var4, ~ifelse(index=="b" |
index == "c", lead(.), .))) %>%
na.omit()
index var1 var2 var3 var4 var5 var6
1 a 40 75 9 1 -1 0
3 b 22 550 180 11 -2 1
5 c 12 650 810 21 -3 0
7 d 4 15 3 61 -4 1
9 e 0 0 0 0 0 0
CodePudding user response:
Based on the data you gave in the OP, and the fact that the 'better' observations are larger numbers, the following seems to give the desired result:
library(dplyr)
df1 <- data.frame(index = c('a', 'b', 'c', 'd', 'e'),
var1 = c(40, 22, 12, 4, 0),
var2 = c(75, 55, 65, 15, 0),
var3 = c(9, 18, 81, 3, 0),
var4 = c(1, 11, 21, 61, 0),
var5 = c(-1, -2, -3, -4, 0),
var6 = c(0, 1, 0, 1, 0))
df2 <- data.frame(index = c('a', 'b', 'c', 'd'),
var1 = c(40, 22, 12, 4),
var2 = c(75, 550, 650, 15),
var3 = c(9, 180, 810, 3),
var4 = c(1, 11, 21, 61))
get_best <- function(col){
value <- max(unlist(col), na.rm=TRUE)
return(value)
}
a <- bind_rows(df1,df2) %>%
group_by(index) %>%
mutate(across(.cols=everything(),
.fns=get_best)) %>%
unique()
Output:
# A tibble: 5 × 7
# Groups: index [5]
index var1 var2 var3 var4 var5 var6
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 40 75 9 1 -1 0
2 b 22 550 180 11 -2 1
3 c 12 650 810 21 -3 0
4 d 4 15 3 61 -4 1
5 e 0 0 0 0 0 0