I am working with a data frame df
with a column text
which contains strings of either the form "WORD"
or "WORD, WORD2"
. More precisely, there may be some gibberish before or after these blocks, but I know how to take care of them.
I would like to use the tidyverse function extract
to obtain extract the two words into two columns t1
and t2
such that the string "WORD"
gets extracted into the "WORD"
and NA
and
- the string
"WORDS"
gets extracted into"WORDS"
andNa
, - the string
"WORDS, WORD2"
gets extracted into"WORDS"
and"WORD2"
.
I tried a command of the following form
df |> extract(x, c("1", "2"), "([^[:punct:]] ),?[[:space:]]?([^[:punct:]]*)",
remove = FALSE,
convert = TRUE)
However, this always reads the first row into "WORD"
and ""
(empty string). How can I modify my version to obtain the desired behaviour?
EDIT: Here is a possible dataframe
library(tidyverse)
df <- data.frame(x = c("123 WORD", "4564 WORD, TEST 1"))
# Expected output
df_out <- data.frame(x = c("123 WORD", "4564 WORD, TEST 1"),
t1 = c("WORD", "WORD"),
t2 = c(NA, "TEST"))
CodePudding user response:
df %>%
extract(x, c('t1','t2'), '(\\w )(?:, (\\w ).*)?$', FALSE) %>%
mutate(across(c(t1, t2), na_if, ''))
x t1 t2
1 123 WORD WORD <NA>
2 4564 WORD, TEST 1 WORD TEST
CodePudding user response:
Maybe something like this:
df <- data.frame(x = c("123 WORD", "4564 WORD, TEST 1"))
library(tidyverse)
df %>%
mutate(x1 = str_remove_all(x, '[0-9]*')) %>%
separate(x1, c("t1", "t2"), sep = ', ', remove = FALSE) %>%
select(-x1)
x t1 t2
1 123 WORD WORD <NA>
2 4564 WORD, TEST 1 WORD TEST