I have a text column whose digits I want to separate into columns:
df <- structure(list(text = c("((i: DH=0, SZ=0, RM=0, FO=0, GP=0, BC=0))",
"((i: DH=0, SZ=0, RM=0, FO=0, GP=0, BC=0))", "((i: DH=1, SZ=0, RM=0, FO=NA, GP=0, BC=0))",
"((i: DH=1, SZ=0, RM=NA, FO=0, GP=0, BC=0))", "((i: DH=1, SZ=1, RM=0, FO=0, GP=0, BC=0))",
"((i: DH=0, SZ=1, RM=0, FO=0, GP=0, BC=0))", "((i: DH=0, SZ=0, RM=1, FO=0, GP=0, BC=1))",
"((i: DH=0, SZ=1, RM=0, FO=0, GP=0, BC=0))", "((i: DH=0, SZ=1, RM=0, FO=0, GP=0, BC=0))",
"((i: DH=0, SZ=1, RM=0, FO=1, GP=0, BC=0))", "((i: DH=0, SZ=1, RM=0, FO=0, GP=0, BC=1))",
"((i: DH=0, SZ=NA, RM=0, FO=0, GP=0, BC=0))", "((i: DH=0, SZ=0, RM=0, FO=0, GP=0, BC=0))"
)), row.names = c(NA, -13L), class = "data.frame")
I'm able to str_extract
the digits but unable to separate the digits into numeric columns using tidyr
's function separate
:
library(stringr)
library(tidyr)
library(dplyr)
df %>%
mutate(text = str_extract_all(text,"(?<==)\\d|NA(?=,)")) %>%
separate(col = text,
into = c("DH", "SZ", "RM", "FO", "GP", "BC"),
sep = ", ",
convert = TRUE)
DH SZ RM FO GP BC
1 c("0" "0" "0" "0" "0" "0")
2 c("0" "0" "0" "0" "0" "0")
3 c("1" "0" "0" "NA" "0" "0")
4 c("1" "0" "NA" "0" "0" "0")
5 c("1" "1" "0" "0" "0" "0")
6 c("0" "1" "0" "0" "0" "0")
7 c("0" "0" "1" "0" "0" "1")
8 c("0" "1" "0" "0" "0" "0")
9 c("0" "1" "0" "0" "0" "0")
10 c("0" "1" "0" "1" "0" "0")
11 c("0" "1" "0" "0" "0" "1")
12 c("0" "NA" "0" "0" "0" "0")
13 c("0" "0" "0" "0" "0" "0")
My hunch is that this is because str_extract_all
returns the digits in a list and separate
does not work on lists. What's the solution here?
CodePudding user response:
This is the purpose of the simplify=T
argument, as it coerces the output into a matrix.
df %>%
mutate(text = str_extract_all(text,"(?<==)\\d|NA(?=,)", simplify=T))
text.1 text.2 text.3 text.4 text.5 text.6
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 1 0 0 NA 0 0
4 1 0 NA 0 0 0
5 1 1 0 0 0 0
6 0 1 0 0 0 0
7 0 0 1 0 0 1
8 0 1 0 0 0 0
9 0 1 0 0 0 0
10 0 1 0 1 0 0
11 0 1 0 0 0 1
12 0 NA 0 0 0 0
13 0 0 0 0 0 0
CodePudding user response:
You can use unnest_wider
instead of separate
and then do a bit of additional tidyverse magic. Note: you'll get a warning because NAs are introduced by coercion, but that is expected in your case.
df %>%
mutate(text = str_extract_all(text,"(?<==)\\d|NA(?=,)")) %>%
unnest_wider(text) %>%
mutate(across(starts_with("..."), ~as.numeric(.))) %>%
rename_with(.cols = starts_with("..."),
.fn = ~c("DH", "SZ", "RM", "FO", "GP", "BC"))
which gives:
# A tibble: 13 x 6
DH SZ RM FO GP BC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 1 0 0 NA 0 0
4 1 0 NA 0 0 0
5 1 1 0 0 0 0
6 0 1 0 0 0 0
7 0 0 1 0 0 1
8 0 1 0 0 0 0
9 0 1 0 0 0 0
10 0 1 0 1 0 0
11 0 1 0 0 0 1
12 0 NA 0 0 0 0
13 0 0 0 0 0 0
An alternative (in terms of renaming the columns) would be to set the names attribute directly in the list which solves all the renaming issues:
library(tidyverse)
df %>%
mutate(text = str_extract_all(text,"(?<==)\\d|NA(?=,)"),
text = map(text, setNames, c("DH", "SZ", "RM", "FO", "GP", "BC"))) %>%
unnest_wider(text) %>%
mutate(across(c("DH", "SZ", "RM", "FO", "GP", "BC"), ~as.numeric(.)))