Home > Software engineering >  Using regex groups in dplyr
Using regex groups in dplyr

Time:12-06

I have a string:

txt <- "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22."

I want to extract author name(s), year, and title from this string. This command, based on regex101 works:

result <- regmatches(txt, regexec("([^\\(] ) \\((\\d )\\). ([^\\.] ).", txt))

result[[1]][2]
[1] "Harris P R, Harris D L"

result[[1]][3]
[1] "1983"

result[[1]][4]
[1] "Training for the Metaindustrial Work Culture"

Assume I have a data frame of strings like txt, for example:

df <- data.frame(txt = c("Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.",
"Cruise M J, Gorenberg B D (1985). The tools of management: keeping high touch in a high tech world. International nursing review, 32(6): 166-169, 173."))

I would like to use regex groups in dplyr as follows:

new_df <- df %>%
    rownames_to_column(var = "row_id") %>%
    mutate(result = regmatches(txt, regexec("([^\\(] ) \\((\\d )\\). ([^\\.] ).", txt)),
           authors = result[[row_id]][2],
           year = result[[row_id]][3],
           title = result[[row_id]][4])

This does not work.

Error in `mutate()`:
! Problem while computing `authors = result[[row_id]][2]`.
Caused by error in `result[[row_id]]`:
! no such index at level 1
Run `rlang::last_error()` to see where the error occurred.

rlang::last_error()

<error/dplyr:::mutate_error>
Error in `mutate()`:
! Problem while computing `authors = result[[row_id]][2]`.
Caused by error in `result[[row_id]]`:
! no such index at level 1
---
Backtrace:
 1. df %>% rownames_to_column(var = "row_id") %>% ...
 3. dplyr:::mutate.data.frame(...)
 4. dplyr:::mutate_cols(.data, dplyr_quosures(...), caller_env = caller_env())
 6. mask$eval_all_mutate(quo)
Run `rlang::last_trace()` to see the full context.

What changes do I need to make? Thanks in advance

CodePudding user response:

You can use strcapture in the mutate call with that regex:

df %>%
  mutate(
    strcapture("([^\\(] ) \\((\\d )\\). ([^\\.] ).", txt, 
               list(authors="", year=0L, title=""))
  ) %>%
  select(-txt)
#                     authors year                                                            title
# 1    Harris P R, Harris D L 1983                     Training for the Metaindustrial Work Culture
# 2 Cruise M J, Gorenberg B D 1985 The tools of management: keeping high touch in a high tech world

(I'm inferring year should be an integer.)

But as to why your code did not work, you need to iterate over the results and extract the nth element of each:

df %>%
  mutate(
    result = regmatches(txt, regexec("([^\\(] ) \\((\\d )\\). ([^\\.] ).", txt)),
    authors = sapply(result, `[[`, 2),
    year = sapply(result, `[[`, 3),
    title = sapply(result, `[[`, 4)
  ) %>%
  select(-txt, -result)
#                     authors year                                                            title
# 1    Harris P R, Harris D L 1983                     Training for the Metaindustrial Work Culture
# 2 Cruise M J, Gorenberg B D 1985 The tools of management: keeping high touch in a high tech world

or somewhat dynamically (a vector of new-column-names):

df %>%
  mutate(
    result = regmatches(txt, regexec("([^\\(] ) \\((\\d )\\). ([^\\.] ).", txt)),
    data.frame(setNames(
      Map(`[[`, result, 2:4),
      c("authors", "year", "title")))
  )

CodePudding user response:

Perhaps unnesting could be useful here:

library(tidyr)
library(dplyr)

df %>%
  mutate(result = regmatches(txt, regexec("([^\\(] ) \\((\\d )\\). ([^\\.] ).", txt))) %>% 
  unnest_wider(result) %>% 
  select(authors = ...2, year = ...3, title = ...4)

This returns

# A tibble: 2 × 3
  authors                   year  title                                                         
  <chr>                     <chr> <chr>                                                         
1 Harris P R, Harris D L    1983  Training for the Metaindustrial Work Culture                  
2 Cruise M J, Gorenberg B D 1985  The tools of management: keeping high touch in a high tech wo…

CodePudding user response:

We can use separate from the tidyr package:

library (dplyr)
library (tidyr)

df %>% separate (txt, c("authors", "year", "title"), sep = r"{ \(|\)\. }")
  • Related