I have a dataframe in the style of
id choice
----------
1 "(a,b,c)"
2 "(c)"
3 "(a,c)"
4 "(d)"
i.e.
DF = data.frame(id=c(1,2,3,4), choice=c("(a,b,c))","(c)","(a,c)","(d)"))
and I would like to obtain a dataframe with boolean columns
id a b c d
----------
1 T T T F
2 F F T F
3 F F T F
4 F F F T
Based on Split comma-separated strings into boolean columns I tried using
library(splitstackshape)
cSplit_e(DF, "choice", sep = "[,()]", mode = "binary",
type = "character", fill = 0, drop = TRUE, fixed = FALSE)
however, this runs out of memory with my actual dataframe (I have ca. 10 million rows & ca. 1.5 million unique choice values (number not known in advance)) & in general is also quite slow.
I was therefore wondering if there is perhaps a faster & more memory efficient way to do this, e.g. using data.table
or dplyr
or multidplyr
and maybe also storing the boolean matrix in a sparse format?
EDIT: in my actual dataset choice
is possible amino acid mutations, so instead of a,b,c, ...
e.g.
(NSP5_P132H,Spike_H69del,Spike_T95I,Spike_A67V,Spike_N969K,Spike_H655Y,Spike_N856K,N_R203K,Spike_G142D,NSP3_A1892T,Spike_Q954H,N_P13L,NSP3_L1266I,N_R32del,M_Q19E,NSP4_T492I,NSP6_L105del,Spike_N679K,Spike_N764K,Spike_L212I,NSP6_G107del,NSP6_I189V,Spike_T547K,M_D3G,Spike_D796Y,N_G204R,Spike_V143del,M_A63T,Spike_K417N)
and I don't have an a priori list of possible mutations occurring in the dataset - so I don't know my choices a priori (there is ca. 1.5 million unique ones)...
EDIT2: the data.table solution below in principle would have worked, were it not for the fact that I would end up with a data table with 1.5 million columns and 10 million rows, which doesn't fit in memory anymore on my machine. I guess storing it as a sparse matrix would really be necessary in the end?
CodePudding user response:
You can try dcast
strsplit
like below
dcast(
setDT(DF)[
,
.(choice = Filter(nzchar, unlist(strsplit(choice, "\\W ")))),
id
],
id ~ choice,
fun.aggregate = function(x) length(x) > 0
)
which gives
id a b c d
1: 1 TRUE TRUE TRUE FALSE
2: 2 FALSE FALSE TRUE FALSE
3: 3 TRUE FALSE TRUE FALSE
4: 4 FALSE FALSE FALSE TRUE
CodePudding user response:
Here is an example with data.table:
library(data.table)
DF = data.table(id=c(1,2,3,4), choice=c("(a,b,c))","(c)","(a,c)","(d)"))
DF$choice <- gsub("\\(|\\)", "", DF$choice)
choices <- unique(do.call('c', strsplit(DF$choice, ",", fixed = TRUE)))
DF[, `colnames<-`(as.data.frame(t(seq_along(choices) %in% as.numeric(
factor(strsplit(choice, ",", fixed = TRUE)[[1]], choices)
))), choices), by = "id"]
#> id a b c d
#> 1: 1 TRUE TRUE TRUE FALSE
#> 2: 2 FALSE FALSE TRUE FALSE
#> 3: 3 TRUE FALSE TRUE FALSE
#> 4: 4 FALSE FALSE FALSE TRUE
CodePudding user response:
Using fastDummies
library(fastDumies)
library(dplyr)
library(stringr)
DF %>%
mutate(choice = str_remove_all(choice, "\\(|\\)")) %>%
dummy_cols("choice", split = ",") %>%
transmute(id, across(starts_with('choice'), as.logical,
.names = "{str_remove(.col, 'choice_')}")) %>%
select(-choice)
-output
id a b c d
1 1 TRUE TRUE TRUE FALSE
2 2 FALSE FALSE TRUE FALSE
3 3 TRUE FALSE TRUE FALSE
4 4 FALSE FALSE FALSE TRUE
With base R
, strsplit
table
should be fast
lst1 <- strsplit(gsub("[()]", "", DF$choice), ",")
table(rep(seq_along(lst1), lengths(lst1)), unlist(lst1)) > 0
a b c d
1 TRUE TRUE TRUE FALSE
2 FALSE FALSE TRUE FALSE
3 TRUE FALSE TRUE FALSE
4 FALSE FALSE FALSE TRUE
CodePudding user response:
Try this
library(tidyverse)
x %>%
mutate(choice = str_extract_all(choice, "\\w")) %>%
unnest(choice) %>%
pivot_wider(
names_from = choice,
values_fill = FALSE,
values_from = choice,
values_fn = ~ !is.na(.)
)
assuming x
is your dataset (i.e. DF
)