I need to match rowwise, character values in a column with names of multiple columns that are not exact matches but partially contain those characters, and get in a new column the corresponding values from columns where strings match. My data:
Name_A Name_B Name_C Name_D PartName
5.1 3.5 1.4 0.2 A
4.9 3.0 1.4 0.2 A
4.7 3.2 1.3 0.2 C
4.6 3.1 1.5 0.2 D
5.0 3.6 1.4 0.2 B
5.4 3.9 1.7 0.4 C
Desired output:
Name_A Name_B Name_C Name_D PartName New_Col
5.1 3.5 1.4 0.2 A 5.1
4.9 3.0 1.4 0.2 A 4.9
4.7 3.2 1.3 0.2 C 1.3
4.6 3.1 1.5 0.2 D 0.2
5.0 3.6 1.4 0.2 B 3.6
5.4 3.9 1.7 0.4 C 1.7
Attempted code:
df %>%
mutate(New_Col = purrr::map2_dbl(row_number(),~df[Name_A,Name_B,Name_C, Name_D]))
Help appreciated!
CodePudding user response:
match
the partial name to the full names with the Name_
prefix removed, and then grab the corresponding value for each row using matrix indexing (row column index in a 2-column matrix):
df$new_col <- df[-ncol(df)][cbind(
seq_len(nrow(df)),
match(df$PartName, sub("^Name_", "", names(df)[-ncol(df)]))
)]
df
# Name_A Name_B Name_C Name_D PartName new_col
#1 5.1 3.5 1.4 0.2 A 5.1
#2 4.9 3.0 1.4 0.2 A 4.9
#3 4.7 3.2 1.3 0.2 C 1.3
#4 4.6 3.1 1.5 0.2 D 0.2
#5 5.0 3.6 1.4 0.2 B 3.6
#6 5.4 3.9 1.7 0.4 C 1.7
CodePudding user response:
Another option in base R is split
-unsplit
:
data$New_Col <- unsplit(Map(`[`,
data[paste0("Name_", LETTERS[1:4])],
split(seq_len(nrow(data)), data$PartName)),
data$PartName)
It scales better than indexing the data frame with a matrix of the form cbind(i, j)
. The latter approach has significant overhead due to an intermediate coercion of the data frame to matrix, which involves a deep copy of all of the variables.
If you do go with split
-unsplit
, then make sure that PartName
is a factor with suitable levels
, as you need the second and third arguments of Map
to correspond elementwise. In this case, it would be good practice to start with:
data$PartName <- factor(data$PartName, levels = LETTERS[1:4])
For the curious:
set.seed(1L)
n <- 1e 06L
r <- 25L
x <- as.data.frame(replicate(r, rnorm(n), simplify = FALSE))
names(x) <- paste0("Name_", LETTERS[1:r])
x$PartName <- LETTERS[1:r][sample.int(r, n, TRUE)]
library("data.table")
setDTthreads(4L)
y <- as.data.table(x)
f1 <- function(x) {
n <- length(x)
i <- seq_len(nrow(x))
j <- match(x$PartName, sub("^Name_", "", names(x)[-n]))
x[-n][cbind(i, j)]
}
f2 <- function(x) {
nms <- names(x)[-length(x)]
g <- factor(x$PartName, levels = sub("^Name_", "", nms))
unsplit(Map(`[`, x[nms], split(seq_len(nrow(x)), g)), g)
}
f3 <- function(x) {
x[, New_Col := .SD[[paste0("Name_", .BY[[1L]])]], by = PartName]
}
bench::mark(f1(x), f2(x), f3(y), iterations = 100L, check = FALSE, filter_gc = FALSE)
## # A tibble: 3 × 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
## <bch:expr> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
## 1 f1(x) 86.1ms 92.3ms 10.9 225.1MB 6.95 100 64 9.21s <NULL> <Rprofmem> <bench_tm> <tibble>
## 2 f2(x) 43.4ms 45.8ms 21.2 61.1MB 3.60 100 17 4.73s <NULL> <Rprofmem> <bench_tm> <tibble>
## 3 f3(y) 77.9ms 79.7ms 12.4 21.1MB 0.247 100 2 8.08s <NULL> <Rprofmem> <bench_tm> <tibble>