Home > Enterprise >  Separate list extracted from text column into multiple numeric columns
Separate list extracted from text column into multiple numeric columns

Time:10-15

I have a text column whose digits I want to separate into columns:

df <- structure(list(text = c("((i: DH=0, SZ=0, RM=0, FO=0, GP=0, BC=0))", 
                              "((i: DH=0, SZ=0, RM=0, FO=0, GP=0, BC=0))", "((i: DH=1, SZ=0, RM=0, FO=NA, GP=0, BC=0))", 
                              "((i: DH=1, SZ=0, RM=NA, FO=0, GP=0, BC=0))", "((i: DH=1, SZ=1, RM=0, FO=0, GP=0, BC=0))", 
                              "((i: DH=0, SZ=1, RM=0, FO=0, GP=0, BC=0))", "((i: DH=0, SZ=0, RM=1, FO=0, GP=0, BC=1))", 
                              "((i: DH=0, SZ=1, RM=0, FO=0, GP=0, BC=0))", "((i: DH=0, SZ=1, RM=0, FO=0, GP=0, BC=0))", 
                              "((i: DH=0, SZ=1, RM=0, FO=1, GP=0, BC=0))", "((i: DH=0, SZ=1, RM=0, FO=0, GP=0, BC=1))", 
                              "((i: DH=0, SZ=NA, RM=0, FO=0, GP=0, BC=0))", "((i: DH=0, SZ=0, RM=0, FO=0, GP=0, BC=0))"
)), row.names = c(NA, -13L), class = "data.frame")

I'm able to str_extractthe digits but unable to separate the digits into numeric columns using tidyr's function separate:

library(stringr)
library(tidyr)
library(dplyr)
df %>%
  mutate(text = str_extract_all(text,"(?<==)\\d|NA(?=,)")) %>%
  separate(col = text,
           into = c("DH", "SZ", "RM", "FO", "GP", "BC"),
           sep = ", ",
           convert = TRUE)
      DH   SZ   RM   FO  GP   BC
1  c("0"  "0"  "0"  "0" "0" "0")
2  c("0"  "0"  "0"  "0" "0" "0")
3  c("1"  "0"  "0" "NA" "0" "0")
4  c("1"  "0" "NA"  "0" "0" "0")
5  c("1"  "1"  "0"  "0" "0" "0")
6  c("0"  "1"  "0"  "0" "0" "0")
7  c("0"  "0"  "1"  "0" "0" "1")
8  c("0"  "1"  "0"  "0" "0" "0")
9  c("0"  "1"  "0"  "0" "0" "0")
10 c("0"  "1"  "0"  "1" "0" "0")
11 c("0"  "1"  "0"  "0" "0" "1")
12 c("0" "NA"  "0"  "0" "0" "0")
13 c("0"  "0"  "0"  "0" "0" "0")

My hunch is that this is because str_extract_all returns the digits in a list and separate does not work on lists. What's the solution here?

CodePudding user response:

This is the purpose of the simplify=T argument, as it coerces the output into a matrix.

df %>%
  mutate(text = str_extract_all(text,"(?<==)\\d|NA(?=,)", simplify=T))
   text.1 text.2 text.3 text.4 text.5 text.6
1       0      0      0      0      0      0
2       0      0      0      0      0      0
3       1      0      0     NA      0      0
4       1      0     NA      0      0      0
5       1      1      0      0      0      0
6       0      1      0      0      0      0
7       0      0      1      0      0      1
8       0      1      0      0      0      0
9       0      1      0      0      0      0
10      0      1      0      1      0      0
11      0      1      0      0      0      1
12      0     NA      0      0      0      0
13      0      0      0      0      0      0

CodePudding user response:

You can use unnest_wider instead of separate and then do a bit of additional tidyverse magic. Note: you'll get a warning because NAs are introduced by coercion, but that is expected in your case.

df %>%
  mutate(text = str_extract_all(text,"(?<==)\\d|NA(?=,)")) %>%
  unnest_wider(text) %>%
  mutate(across(starts_with("..."), ~as.numeric(.))) %>%
  rename_with(.cols = starts_with("..."),
              .fn = ~c("DH", "SZ", "RM", "FO", "GP", "BC"))

which gives:

# A tibble: 13 x 6
      DH    SZ    RM    FO    GP    BC
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     0     0     0     0     0     0
 2     0     0     0     0     0     0
 3     1     0     0    NA     0     0
 4     1     0    NA     0     0     0
 5     1     1     0     0     0     0
 6     0     1     0     0     0     0
 7     0     0     1     0     0     1
 8     0     1     0     0     0     0
 9     0     1     0     0     0     0
10     0     1     0     1     0     0
11     0     1     0     0     0     1
12     0    NA     0     0     0     0
13     0     0     0     0     0     0

An alternative (in terms of renaming the columns) would be to set the names attribute directly in the list which solves all the renaming issues:

library(tidyverse)
df %>%
  mutate(text = str_extract_all(text,"(?<==)\\d|NA(?=,)"),
         text = map(text, setNames, c("DH", "SZ", "RM", "FO", "GP", "BC"))) %>%
  unnest_wider(text) %>%
  mutate(across(c("DH", "SZ", "RM", "FO", "GP", "BC"), ~as.numeric(.)))
  • Related