I have a data frame:
a <- c(0, 0, 1, 3, NA, 0, 0, NA)
b <- c(NA, 0, 1, 3, 3, NA, 6, 4)
c <- c(0, NA, 1, 1, 5, 0, NA, 0)
d <- c(4, 0, 0, 2, 3, NA, 1, 4)
e <- c(NA, NA, 0, 0, 6, 1, 1, 0)
f <- c(0, NA, 0, 0, 3, 5, 1, 4)
df <- data.frame(a,b,c,d,e,f)
df
a b c d e f
1 0 NA 0 4 NA 0
2 0 0 NA 0 NA NA
3 1 1 1 0 0 0
4 3 3 1 2 0 0
5 NA 3 5 3 6 3
6 0 NA 0 NA 1 5
7 0 6 NA 1 1 1
8 NA 4 0 4 0 4
I want to create new variables as follows:
df %>%
mutate(new_var1 = a (b/7),
new_var2 = c (d/7),
new_var3 = e (f/7))
Within pairs, if the value in column a, c, or e is not NA, but the value in column b, d, or f is NA, then I would like R to return the value in a, c, or e.
If the value in column a, c, or e is NA, but the value in column b, d, or f is not, then I would like R to return b/7, d/7, or f/7.
Since I want to do this across multiple pairs of columns in the data frame, it would also be helpful to know a more efficient way to iterate through.
Any advice would be greatly appreciated!
CodePudding user response:
In base R:
reshape(df, matrix(seq(ncol(df)),2), dir='long')|>
transform(new_var = ifelse(is.na(a), b/7,
ifelse(is.na(b), a, a (b/7))))|>
reshape(dir='wide', drop = c('a','b'), sep='')
id new_var1 new_var2 new_var3
1.1 1 0.0000000 0.5714286 0.0000000
2.1 2 0.0000000 0.0000000 NA
3.1 3 1.1428571 1.0000000 0.0000000
4.1 4 3.4285714 1.2857143 0.0000000
5.1 5 0.4285714 5.4285714 6.4285714
6.1 6 0.0000000 0.0000000 1.7142857
7.1 7 0.8571429 0.1428571 1.1428571
8.1 8 0.5714286 0.5714286 0.5714286
CodePudding user response:
You show a logic that considers the first and adjacent column to the right, and then moves two to the right. You could set up a function f
with the replacement instructions, for efficiency avoiding ifelse
in favor of subsetting.
f <- \(x) {
y <- x 1
s0 <- !is.na(df[, x]) & !is.na(df[, y])
s1 <- !is.na(df[, x]) & is.na(df[, y])
s2 <- is.na(df[, x]) & !is.na(df[, y])
r <- rep.int(NA_real_, nrow(df))
r[s0] <- df[s0, x] df[s0, y]/7
r[s1] <- df[s1, x]
r[s2] <- df[s2, y]/7
r
}
Then just use the f
unction in an sapply
with a seq
uence [1, 3, 5, ..., ncol(df) - 1].
r <- sapply(seq.int(1, ncol(df) - 1, 2), f) |>
`colnames<-`(paste0('nv', seq_len(ncol(df)/2)))
r
# nv1 nv2 nv3
# # [1,] 0.0000000 0.5714286 0.0000000
# # [2,] 0.0000000 0.0000000 NA
# # [3,] 1.1428571 1.0000000 0.0000000
# # [4,] 3.4285714 1.2857143 0.0000000
# # [5,] 0.4285714 5.4285714 6.4285714
# # [6,] 0.0000000 0.0000000 1.7142857
# # [7,] 0.8571429 0.1428571 1.1428571
# # [8,] 0.5714286 0.5714286 0.5714286
Finally cbind
the result to your data frame.
cbind(df, r)
This appears to be five times faster than using reshape/ifelse
.
Data:
df <- structure(list(a = c(0, 0, 1, 3, NA, 0, 0, NA), b = c(NA, 0,
1, 3, 3, NA, 6, 4), c = c(0, NA, 1, 1, 5, 0, NA, 0), d = c(4,
0, 0, 2, 3, NA, 1, 4), e = c(NA, NA, 0, 0, 6, 1, 1, 0), f = c(0,
NA, 0, 0, 3, 5, 1, 4)), class = "data.frame", row.names = c(NA,
-8L))