Home > Back-end >  R match rowwise values with column names in multiple columns and get column value
R match rowwise values with column names in multiple columns and get column value

Time:02-23

I need to match rowwise, character values in a column with names of multiple columns that are not exact matches but partially contain those characters, and get in a new column the corresponding values from columns where strings match. My data:

Name_A Name_B Name_C Name_D PartName
5.1    3.5    1.4    0.2    A
4.9    3.0    1.4    0.2    A
4.7    3.2    1.3    0.2    C
4.6    3.1    1.5    0.2    D
5.0    3.6    1.4    0.2    B
5.4    3.9    1.7    0.4    C

Desired output:

Name_A Name_B Name_C Name_D PartName New_Col
5.1    3.5    1.4    0.2    A        5.1
4.9    3.0    1.4    0.2    A        4.9
4.7    3.2    1.3    0.2    C        1.3
4.6    3.1    1.5    0.2    D        0.2
5.0    3.6    1.4    0.2    B        3.6
5.4    3.9    1.7    0.4    C        1.7

Attempted code:

df %>%
  mutate(New_Col = purrr::map2_dbl(row_number(),~df[Name_A,Name_B,Name_C, Name_D]))

Help appreciated!

CodePudding user response:

match the partial name to the full names with the Name_ prefix removed, and then grab the corresponding value for each row using matrix indexing (row column index in a 2-column matrix):

df$new_col <- df[-ncol(df)][cbind(
  seq_len(nrow(df)),
  match(df$PartName, sub("^Name_", "", names(df)[-ncol(df)]))
)]

df
#  Name_A Name_B Name_C Name_D PartName new_col
#1    5.1    3.5    1.4    0.2        A     5.1
#2    4.9    3.0    1.4    0.2        A     4.9
#3    4.7    3.2    1.3    0.2        C     1.3
#4    4.6    3.1    1.5    0.2        D     0.2
#5    5.0    3.6    1.4    0.2        B     3.6
#6    5.4    3.9    1.7    0.4        C     1.7

CodePudding user response:

Another option in base R is split-unsplit:

data$New_Col <- unsplit(Map(`[`, 
                            data[paste0("Name_", LETTERS[1:4])],
                            split(seq_len(nrow(data)), data$PartName)),
                        data$PartName)

It scales better than indexing the data frame with a matrix of the form cbind(i, j). The latter approach has significant overhead due to an intermediate coercion of the data frame to matrix, which involves a deep copy of all of the variables.

If you do go with split-unsplit, then make sure that PartName is a factor with suitable levels, as you need the second and third arguments of Map to correspond elementwise. In this case, it would be good practice to start with:

data$PartName <- factor(data$PartName, levels = LETTERS[1:4])

For the curious:

set.seed(1L)
n <- 1e 06L
r <- 25L
x <- as.data.frame(replicate(r, rnorm(n), simplify = FALSE))
names(x) <- paste0("Name_", LETTERS[1:r])
x$PartName <- LETTERS[1:r][sample.int(r, n, TRUE)]

library("data.table")
setDTthreads(4L)
y <- as.data.table(x)

f1 <- function(x) {
    n <- length(x)
    i <- seq_len(nrow(x))
    j <- match(x$PartName, sub("^Name_", "", names(x)[-n]))
    x[-n][cbind(i, j)]
}
f2 <- function(x) {
    nms <- names(x)[-length(x)]
    g <- factor(x$PartName, levels = sub("^Name_", "", nms))
    unsplit(Map(`[`, x[nms], split(seq_len(nrow(x)), g)), g)
}
f3 <- function(x) {
    x[, New_Col := .SD[[paste0("Name_", .BY[[1L]])]], by = PartName]
}

bench::mark(f1(x), f2(x), f3(y), iterations = 100L, check = FALSE, filter_gc = FALSE)
## # A tibble: 3 × 13
##   expression      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
##   <bch:expr> <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
## 1 f1(x)        86.1ms  92.3ms      10.9   225.1MB    6.95    100    64      9.21s <NULL> <Rprofmem> <bench_tm> <tibble>
## 2 f2(x)        43.4ms  45.8ms      21.2    61.1MB    3.60    100    17      4.73s <NULL> <Rprofmem> <bench_tm> <tibble>
## 3 f3(y)        77.9ms  79.7ms      12.4    21.1MB    0.247   100     2      8.08s <NULL> <Rprofmem> <bench_tm> <tibble>
  • Related