Home > Net >  Efficient row-wise string concatenation with NA omission
Efficient row-wise string concatenation with NA omission

Time:10-04

I have a data.table of the sample structure below:

dt = data.table(
  V1 = c('One', 'Two', 'Three'),
  V2 = c('Red', NA, NA),
  V3 = c('Cat', NA, 'Dogs')
)

> dt
      V1   V2   V3
1:   One  Red  Cat
2:   Two <NA> <NA>
3: Three <NA> Dogs

I would like to perform a row-wise concatenation of the elements into a new column which omits NA:

> dt
      V1   V2   V3          V4
1:   One  Red  Cat One Red Cat
2:   Two <NA> <NA>         Two
3: Three <NA> Dogs  Three Dogs

In this trivial example, I can of course transpose the object and execute lapply(.SD, paste(x[!is.na(x)])), however, transposing is too computationally expensive. I would also rather not have to strip out NAs that have been coerced to character in a second step. In short, I would welcome any high-performance solution to this.

CodePudding user response:

Using paste in do.call, replace to delete the NAs, gsub double to single spaces, and trimws.

dt[, V4 := trimws(gsub('  ', ' ', do.call(paste, replace(.SD, is.na(.SD), '')), fixed=TRUE))]
#       V1   V2   V3          V4
# 1:   One  Red  Cat One Red Cat
# 2:   Two <NA> <NA>         Two
# 3: Three <NA> Dogs  Three Dogs

CodePudding user response:

Edited post to make expressions identical

Both of the answers above make sense. I had considered one using apply() as well, but the answer of which one is more efficient depends on the size of the data. I've seen this happen before, but the "tidy" solution is slower on small datasets, but much faster on large datasets. Here's are the benchmarks for the three-observation dataset:

library(microbenchmark)
library(data.table)
library(dplyr)
microbenchmark(
   "apply" = {
     dt = data.table(
       V1 = c('One', 'Two', 'Three'),
       V2 = c('Red', NA, NA),
       V3 = c('Cat', NA, 'Dogs'))
     dt$V4 <- apply(dt, 1, \(x)paste(x[!is.na(x)], collapse = " "))
     dt
     }, 
   "dt" = {
     dt = data.table(
       V1 = c('One', 'Two', 'Three'),
       V2 = c('Red', NA, NA),
       V3 = c('Cat', NA, 'Dogs'))
     dt[, V4 := trimws(gsub('  ', ' ', do.call(paste, replace(.SD, is.na(.SD), '')), fixed=TRUE))]
     dt
   }, 
   "dplyr" = {
     dt = data.table(
       V1 = c('One', 'Two', 'Three'),
       V2 = c('Red', NA, NA),
       V3 = c('Cat', NA, 'Dogs')) 
     dt <- dt %>% 
       mutate(across(everything(), ~case_when(is.na(.x) ~ "", TRUE ~ .x))) %>%
       tidyr::unite(V4, dplyr::everything(), sep = " ", 
                             remove = FALSE) %>%
     mutate(V4 = trimws(gsub("  ", " ", V4)), 
            across(-V4, ~case_when(.x == "" ~ NA_character_, 
                                   TRUE ~ .x))) %>%    
    
       dplyr::select(V1:V3, V4)
     dt
     }, check="identical")
#> Unit: microseconds
#>   expr      min       lq      mean    median        uq       max neval cld
#>  apply  190.749  217.762  300.7254  240.4665  255.3140  6411.430   100 a  
#>     dt  596.156  632.276  699.1835  671.5225  707.8065  3191.156   100  b 
#>  dplyr 7682.044 7950.626 8611.9900 8299.9260 8703.1520 14222.238   100   c

Above, the tidy solution is slowest and the apply() solution is fastest. Next, I replicated each row in the data 10,000 times, making a 30,000 observation dataset dt. Here are the benchmarks on 10 iterations:

microbenchmark(
  "apply" = {
    dt = data.table(
      V1 = c('One', 'Two', 'Three'),
      V2 = c('Red', NA, NA),
      V3 = c('Cat', NA, 'Dogs'))
    dt <- dt[rep(1:3, each=10000), ]
    dt$V4 <- apply(dt, 1, \(x)paste(x[!is.na(x)], collapse = " "))
    dt
  }, 
  "dt" = {
    dt = data.table(
      V1 = c('One', 'Two', 'Three'),
      V2 = c('Red', NA, NA),
      V3 = c('Cat', NA, 'Dogs'))
    dt <- dt[rep(1:3, each=10000), ]
    dt[, V4 := trimws(gsub('  ', ' ', do.call(paste, replace(.SD, is.na(.SD), '')), fixed=TRUE))]
    dt
  }, 
  "dplyr" = {
    dt = data.table(
      V1 = c('One', 'Two', 'Three'),
      V2 = c('Red', NA, NA),
      V3 = c('Cat', NA, 'Dogs')) 
    dt <- dt[rep(1:3, each=10000), ]
    dt <- dt %>% 
      tidyr::unite(V4, dplyr::everything(), sep = " ", 
                   remove = FALSE) %>%
       mutate(V4 = trimws(gsub("\\s*NA\\s*", " ", V4)))  %>%    
       dplyr::select(V1:V3, V4)
    dt
  }, 
  check = "identical", 
  times=10)
#> Unit: milliseconds
#>   expr       min       lq      mean    median        uq       max neval cld
#>  apply 106.95376 108.5313 116.07324 114.84741 117.57105 141.48294    10   c
#>     dt  18.81052  18.8378  19.30309  19.31539  19.61948  20.08436    10 a  
#>  dplyr  40.08105  40.8789  41.23341  41.32001  41.53007  42.56531    10  b

Created on 2022-10-03 by the reprex package (v2.0.1)

Above, the dt solution is fastest by a factor of about 2 over the tidy solution and by around 6 versus the apply() solution.

  • Related