Working with large dataframe with a string that is a list of elements (as chr). I want to separate the string so that each element has its own column with key - value. I tried 'tidyr::seperate' and 'tidyverse::unnest_wider()' but none of them returned my desired output.
Here is a dummy data :
df1 <- tibble(
id = c('000914', '000916'),
code = c('NN', 'SS'),
values2 = c("{DS=15}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}" , "{DS=0}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}"
) )
# A tibble: 2 x 3
id code values2
<chr> <chr> <chr>
1 000914 NN {DS=15}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}
2 000916 SS {DS=0}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}
I tried with separate, this is not wrong but it is not exactly what I am looking for and it needs several pivot_longer and pivot_wider for reshaping. Is there any better and faster alternative?
df1 %>%
separate(values2, into = paste("Col", 1:14))
# A tibble: 2 x 16
id code `Col 1` `Col 2` `Col 3` `Col 4` `Col 5` `Col 6` `Col 7` `Col 8` `Col 9`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 000914 NN "" DS 15 FPLUC 0 N CELL R NINT1
2 000916 SS "" DS 0 FPLUC 0 N CELL R NINT1
# ... with 5 more variables: Col 10 <chr>, Col 11 <chr>, Col 12 <chr>, Col 13 <chr>,
# Col 14 <chr>
Here is my desired output:
id code DS FPLUC N R S SPLUC
1 000914 NN 15 0 CELL NINT1 true 1
2 000916 SS 0 0 CELL NINT1 true 1
CodePudding user response:
A pure tidyr
solution:
library(tidyr)
df1 %>%
separate_rows(values2, sep = '(?<=\\})(?=\\{)') %>%
extract(values2, c('name', 'value'), '\\{(. ?)=(. ?)\\}') %>%
pivot_wider()
# # A tibble: 2 × 8
# id code DS FPLUC N R S SPLUC
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 000914 NN 15 0 CELL NINT1 true 1
# 2 000916 SS 0 0 CELL NINT1 true 1
separate_rows()
separates a collapsed column (values2
) into multiple rows. The separator(?<=\\})(?=\\{)
locates the position between}
and{
.extract()
separates a character column into multiple columns using regular expression groups. The regex\\{(. ?)=(. ?)\\}
searches the pattern{Col=Value}
and extractsCol
andValue
repectivly as new columns.
CodePudding user response:
It's messy but you may try
library(tidverse)
nms <- str_extract_all(df1$values2[1], "(?<=\\{). ?(?=\\=)", simplify = T)
nms <- c(names(df1)[-3],nms)
df1 %>%
mutate(values2 = str_extract_all(values2, "(?<=\\=). ?(?=\\})")) %>%
unnest_wider(values2, names_repair = ~nms)
id code DS FPLUC N R S SPLUC
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 000914 NN 15 0 CELL NINT1 true 1
2 000916 SS 0 0 CELL NINT1 true 1
CodePudding user response:
If you are not so keen on the REGEX, try the following
library(dplyr, quietly=TRUE, warn.conflicts=FALSE)
#> Warning: package 'dplyr' was built under R version 4.1.3
library(tidyr)
df1 <- tibble(
id = c('000914', '000916'),
code = c('NN', 'SS'),
values2 = c("{DS=15}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}" , "{DS=0}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}"
) )
df1
#> # A tibble: 2 x 3
#> id code values2
#> <chr> <chr> <chr>
#> 1 000914 NN {DS=15}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}
#> 2 000916 SS {DS=0}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}
df1 %>%
mutate(values2 = stringr::str_remove_all(values2, "\\}")) %>% # remove the } from values 2
separate(values2, into = c("X","DS","FPLUC","N","R","S","SPLUC"), sep = "{") %>% # split values 2 into required columns
mutate(across(.cols = c(DS, FPLUC, N, R, S, SPLUC),
.fns = ~stringr::str_remove(.x, "^. ="))) %>% #remove "xxx=" from each of the columns
select(!X) # keep all columns except X as it is empty
#> # A tibble: 2 x 8
#> id code DS FPLUC N R S SPLUC
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 000914 NN 15 0 CELL NINT1 true 1
#> 2 000916 SS 0 0 CELL NINT1 true 1