Fast way to split comma-separated strings into (sparse) boolean matrix in R-CodePudding

I have a dataframe in the style of

id choice
----------
1  "(a,b,c)"
2  "(c)"
3  "(a,c)"
4  "(d)"

i.e.

DF = data.frame(id=c(1,2,3,4), choice=c("(a,b,c))","(c)","(a,c)","(d)"))

and I would like to obtain a dataframe with boolean columns

id a b c d
----------
1  T T T F
2  F F T F
3  F F T F
4  F F F T

Based on Split comma-separated strings into boolean columns I tried using

library(splitstackshape)
cSplit_e(DF, "choice", sep = "[,()]", mode = "binary", 
         type = "character", fill = 0, drop = TRUE, fixed = FALSE)

however, this runs out of memory with my actual dataframe (I have ca. 10 million rows & ca. 1.5 million unique choice values (number not known in advance)) & in general is also quite slow.

I was therefore wondering if there is perhaps a faster & more memory efficient way to do this, e.g. using data.table or dplyr or multidplyr and maybe also storing the boolean matrix in a sparse format?

EDIT: in my actual dataset choice is possible amino acid mutations, so instead of a,b,c, ... e.g.

(NSP5_P132H,Spike_H69del,Spike_T95I,Spike_A67V,Spike_N969K,Spike_H655Y,Spike_N856K,N_R203K,Spike_G142D,NSP3_A1892T,Spike_Q954H,N_P13L,NSP3_L1266I,N_R32del,M_Q19E,NSP4_T492I,NSP6_L105del,Spike_N679K,Spike_N764K,Spike_L212I,NSP6_G107del,NSP6_I189V,Spike_T547K,M_D3G,Spike_D796Y,N_G204R,Spike_V143del,M_A63T,Spike_K417N)

and I don't have an a priori list of possible mutations occurring in the dataset - so I don't know my choices a priori (there is ca. 1.5 million unique ones)...

EDIT2: the data.table solution below in principle would have worked, were it not for the fact that I would end up with a data table with 1.5 million columns and 10 million rows, which doesn't fit in memory anymore on my machine. I guess storing it as a sparse matrix would really be necessary in the end?

CodePudding user response：

You can try dcast strsplit like below

dcast(
  setDT(DF)[
    ,
    .(choice = Filter(nzchar, unlist(strsplit(choice, "\\W ")))),
    id
  ],
  id ~ choice,
  fun.aggregate = function(x) length(x) > 0
)

which gives

   id     a     b     c     d
1:  1  TRUE  TRUE  TRUE FALSE
2:  2 FALSE FALSE  TRUE FALSE
3:  3  TRUE FALSE  TRUE FALSE
4:  4 FALSE FALSE FALSE  TRUE

CodePudding user response：

Here is an example with data.table:

library(data.table)
DF = data.table(id=c(1,2,3,4), choice=c("(a,b,c))","(c)","(a,c)","(d)"))
DF$choice <- gsub("\\(|\\)", "", DF$choice)
choices <- unique(do.call('c', strsplit(DF$choice, ",", fixed = TRUE)))

DF[, `colnames<-`(as.data.frame(t(seq_along(choices) %in% as.numeric(
  factor(strsplit(choice, ",", fixed = TRUE)[[1]], choices)
))), choices), by = "id"]
#>    id     a     b     c     d
#> 1:  1  TRUE  TRUE  TRUE FALSE
#> 2:  2 FALSE FALSE  TRUE FALSE
#> 3:  3  TRUE FALSE  TRUE FALSE
#> 4:  4 FALSE FALSE FALSE  TRUE

CodePudding user response：

Using fastDummies

library(fastDumies)
library(dplyr)
library(stringr)
DF %>%
  mutate(choice = str_remove_all(choice, "\\(|\\)")) %>% 
  dummy_cols("choice", split = ",") %>%
  transmute(id, across(starts_with('choice'), as.logical, 
     .names = "{str_remove(.col, 'choice_')}")) %>%
  select(-choice)

-output

  id     a     b     c     d
1  1  TRUE  TRUE  TRUE FALSE
2  2 FALSE FALSE  TRUE FALSE
3  3  TRUE FALSE  TRUE FALSE
4  4 FALSE FALSE FALSE  TRUE

With base R, strsplit table should be fast

lst1 <- strsplit(gsub("[()]", "", DF$choice), ",")
table(rep(seq_along(lst1), lengths(lst1)), unlist(lst1)) > 0
   
        a     b     c     d
  1  TRUE  TRUE  TRUE FALSE
  2 FALSE FALSE  TRUE FALSE
  3  TRUE FALSE  TRUE FALSE
  4 FALSE FALSE FALSE  TRUE

CodePudding user response：

Try this

library(tidyverse)
x %>%
  mutate(choice = str_extract_all(choice, "\\w")) %>%
  unnest(choice) %>%
  pivot_wider(
    names_from = choice,
    values_fill = FALSE,
    values_from = choice,
    values_fn = ~ !is.na(.)
  )

assuming x is your dataset (i.e. DF)