I have a data.table
of the sample structure below:
dt = data.table(
V1 = c('One', 'Two', 'Three'),
V2 = c('Red', NA, NA),
V3 = c('Cat', NA, 'Dogs')
)
> dt
V1 V2 V3
1: One Red Cat
2: Two <NA> <NA>
3: Three <NA> Dogs
I would like to perform a row-wise concatenation of the elements into a new column which omits NA
:
> dt
V1 V2 V3 V4
1: One Red Cat One Red Cat
2: Two <NA> <NA> Two
3: Three <NA> Dogs Three Dogs
In this trivial example, I can of course transpose the object and execute lapply(.SD, paste(x[!is.na(x)]))
, however, transposing is too computationally expensive. I would also rather not have to strip out NAs
that have been coerced to character
in a second step. In short, I would welcome any high-performance solution to this.
CodePudding user response:
Using paste
in do.call
, replace
to delete the NA
s, gsub
double to single spaces, and trimws
.
dt[, V4 := trimws(gsub(' ', ' ', do.call(paste, replace(.SD, is.na(.SD), '')), fixed=TRUE))]
# V1 V2 V3 V4
# 1: One Red Cat One Red Cat
# 2: Two <NA> <NA> Two
# 3: Three <NA> Dogs Three Dogs
CodePudding user response:
Edited post to make expressions identical
Both of the answers above make sense. I had considered one using apply()
as well, but the answer of which one is more efficient depends on the size of the data. I've seen this happen before, but the "tidy" solution is slower on small datasets, but much faster on large datasets. Here's are the benchmarks for the three-observation dataset:
library(microbenchmark)
library(data.table)
library(dplyr)
microbenchmark(
"apply" = {
dt = data.table(
V1 = c('One', 'Two', 'Three'),
V2 = c('Red', NA, NA),
V3 = c('Cat', NA, 'Dogs'))
dt$V4 <- apply(dt, 1, \(x)paste(x[!is.na(x)], collapse = " "))
dt
},
"dt" = {
dt = data.table(
V1 = c('One', 'Two', 'Three'),
V2 = c('Red', NA, NA),
V3 = c('Cat', NA, 'Dogs'))
dt[, V4 := trimws(gsub(' ', ' ', do.call(paste, replace(.SD, is.na(.SD), '')), fixed=TRUE))]
dt
},
"dplyr" = {
dt = data.table(
V1 = c('One', 'Two', 'Three'),
V2 = c('Red', NA, NA),
V3 = c('Cat', NA, 'Dogs'))
dt <- dt %>%
mutate(across(everything(), ~case_when(is.na(.x) ~ "", TRUE ~ .x))) %>%
tidyr::unite(V4, dplyr::everything(), sep = " ",
remove = FALSE) %>%
mutate(V4 = trimws(gsub(" ", " ", V4)),
across(-V4, ~case_when(.x == "" ~ NA_character_,
TRUE ~ .x))) %>%
dplyr::select(V1:V3, V4)
dt
}, check="identical")
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> apply 190.749 217.762 300.7254 240.4665 255.3140 6411.430 100 a
#> dt 596.156 632.276 699.1835 671.5225 707.8065 3191.156 100 b
#> dplyr 7682.044 7950.626 8611.9900 8299.9260 8703.1520 14222.238 100 c
Above, the tidy solution is slowest and the apply()
solution is fastest. Next, I replicated each row in the data 10,000 times, making a 30,000 observation dataset dt
. Here are the benchmarks on 10 iterations:
microbenchmark(
"apply" = {
dt = data.table(
V1 = c('One', 'Two', 'Three'),
V2 = c('Red', NA, NA),
V3 = c('Cat', NA, 'Dogs'))
dt <- dt[rep(1:3, each=10000), ]
dt$V4 <- apply(dt, 1, \(x)paste(x[!is.na(x)], collapse = " "))
dt
},
"dt" = {
dt = data.table(
V1 = c('One', 'Two', 'Three'),
V2 = c('Red', NA, NA),
V3 = c('Cat', NA, 'Dogs'))
dt <- dt[rep(1:3, each=10000), ]
dt[, V4 := trimws(gsub(' ', ' ', do.call(paste, replace(.SD, is.na(.SD), '')), fixed=TRUE))]
dt
},
"dplyr" = {
dt = data.table(
V1 = c('One', 'Two', 'Three'),
V2 = c('Red', NA, NA),
V3 = c('Cat', NA, 'Dogs'))
dt <- dt[rep(1:3, each=10000), ]
dt <- dt %>%
tidyr::unite(V4, dplyr::everything(), sep = " ",
remove = FALSE) %>%
mutate(V4 = trimws(gsub("\\s*NA\\s*", " ", V4))) %>%
dplyr::select(V1:V3, V4)
dt
},
check = "identical",
times=10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> apply 106.95376 108.5313 116.07324 114.84741 117.57105 141.48294 10 c
#> dt 18.81052 18.8378 19.30309 19.31539 19.61948 20.08436 10 a
#> dplyr 40.08105 40.8789 41.23341 41.32001 41.53007 42.56531 10 b
Created on 2022-10-03 by the reprex package (v2.0.1)
Above, the dt
solution is fastest by a factor of about 2 over the tidy solution and by around 6 versus the apply()
solution.