Home > Back-end >  R - Extract Regex groups
R - Extract Regex groups

Time:02-11

I am working with a data frame df with a column text which contains strings of either the form "WORD" or "WORD, WORD2". More precisely, there may be some gibberish before or after these blocks, but I know how to take care of them.

I would like to use the tidyverse function extract to obtain extract the two words into two columns t1 and t2 such that the string "WORD" gets extracted into the "WORD" and NA and

  • the string "WORDS" gets extracted into "WORDS" and Na,
  • the string "WORDS, WORD2" gets extracted into "WORDS" and "WORD2".

I tried a command of the following form

df |> extract(x, c("1", "2"), "([^[:punct:]] ),?[[:space:]]?([^[:punct:]]*)",
              remove = FALSE,
              convert = TRUE)

However, this always reads the first row into "WORD" and "" (empty string). How can I modify my version to obtain the desired behaviour?

EDIT: Here is a possible dataframe

library(tidyverse)

df <- data.frame(x = c("123 WORD", "4564 WORD, TEST 1"))

# Expected output
df_out <- data.frame(x = c("123 WORD", "4564 WORD, TEST 1"),
                     t1 = c("WORD", "WORD"),
                     t2 = c(NA, "TEST"))

CodePudding user response:

df %>%
  extract(x, c('t1','t2'), '(\\w )(?:, (\\w ).*)?$', FALSE) %>%
  mutate(across(c(t1, t2), na_if, ''))

                  x   t1   t2
1          123 WORD WORD <NA>
2 4564 WORD, TEST 1 WORD TEST

CodePudding user response:

Maybe something like this:

df <- data.frame(x = c("123 WORD", "4564 WORD, TEST 1"))

library(tidyverse)

df %>%  
  mutate(x1 = str_remove_all(x, '[0-9]*')) %>% 
  separate(x1, c("t1", "t2"), sep = ', ', remove = FALSE) %>% 
  select(-x1)
                  x    t1    t2
1          123 WORD  WORD  <NA>
2 4564 WORD, TEST 1  WORD TEST 
  • Related