I have a string:
txt <- "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22."
I want to extract author name(s), year, and title from this string. This command, based on regex101
works:
result <- regmatches(txt, regexec("([^\\(] ) \\((\\d )\\). ([^\\.] ).", txt))
result[[1]][2]
[1] "Harris P R, Harris D L"
result[[1]][3]
[1] "1983"
result[[1]][4]
[1] "Training for the Metaindustrial Work Culture"
Assume I have a data frame of strings like txt, for example:
df <- data.frame(txt = c("Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.",
"Cruise M J, Gorenberg B D (1985). The tools of management: keeping high touch in a high tech world. International nursing review, 32(6): 166-169, 173."))
I would like to use regex groups in dplyr
as follows:
new_df <- df %>%
rownames_to_column(var = "row_id") %>%
mutate(result = regmatches(txt, regexec("([^\\(] ) \\((\\d )\\). ([^\\.] ).", txt)),
authors = result[[row_id]][2],
year = result[[row_id]][3],
title = result[[row_id]][4])
This does not work.
Error in `mutate()`:
! Problem while computing `authors = result[[row_id]][2]`.
Caused by error in `result[[row_id]]`:
! no such index at level 1
Run `rlang::last_error()` to see where the error occurred.
rlang::last_error()
<error/dplyr:::mutate_error>
Error in `mutate()`:
! Problem while computing `authors = result[[row_id]][2]`.
Caused by error in `result[[row_id]]`:
! no such index at level 1
---
Backtrace:
1. df %>% rownames_to_column(var = "row_id") %>% ...
3. dplyr:::mutate.data.frame(...)
4. dplyr:::mutate_cols(.data, dplyr_quosures(...), caller_env = caller_env())
6. mask$eval_all_mutate(quo)
Run `rlang::last_trace()` to see the full context.
What changes do I need to make? Thanks in advance
CodePudding user response:
You can use strcapture
in the mutate call with that regex:
df %>%
mutate(
strcapture("([^\\(] ) \\((\\d )\\). ([^\\.] ).", txt,
list(authors="", year=0L, title=""))
) %>%
select(-txt)
# authors year title
# 1 Harris P R, Harris D L 1983 Training for the Metaindustrial Work Culture
# 2 Cruise M J, Gorenberg B D 1985 The tools of management: keeping high touch in a high tech world
(I'm inferring year
should be an integer.)
But as to why your code did not work, you need to iterate over the results and extract the nth element of each:
df %>%
mutate(
result = regmatches(txt, regexec("([^\\(] ) \\((\\d )\\). ([^\\.] ).", txt)),
authors = sapply(result, `[[`, 2),
year = sapply(result, `[[`, 3),
title = sapply(result, `[[`, 4)
) %>%
select(-txt, -result)
# authors year title
# 1 Harris P R, Harris D L 1983 Training for the Metaindustrial Work Culture
# 2 Cruise M J, Gorenberg B D 1985 The tools of management: keeping high touch in a high tech world
or somewhat dynamically (a vector of new-column-names):
df %>%
mutate(
result = regmatches(txt, regexec("([^\\(] ) \\((\\d )\\). ([^\\.] ).", txt)),
data.frame(setNames(
Map(`[[`, result, 2:4),
c("authors", "year", "title")))
)
CodePudding user response:
Perhaps unnest
ing could be useful here:
library(tidyr)
library(dplyr)
df %>%
mutate(result = regmatches(txt, regexec("([^\\(] ) \\((\\d )\\). ([^\\.] ).", txt))) %>%
unnest_wider(result) %>%
select(authors = ...2, year = ...3, title = ...4)
This returns
# A tibble: 2 × 3
authors year title
<chr> <chr> <chr>
1 Harris P R, Harris D L 1983 Training for the Metaindustrial Work Culture
2 Cruise M J, Gorenberg B D 1985 The tools of management: keeping high touch in a high tech wo…
CodePudding user response:
We can use separate
from the tidyr package:
library (dplyr)
library (tidyr)
df %>% separate (txt, c("authors", "year", "title"), sep = r"{ \(|\)\. }")