I'm looking for a two-digit number that comes before the word "years" and a seven- or eight-digit number that comes after the word "years." An example of the data is shown below.
data <- "mr john is 45 years old his number is 12345678, mr doe is 57 years 7654321"
data <- as.list(data)
I tried this approach and was successful in getting two digit numbers before the word "years" :
stringr::str_extract_all(data,regex(".\\d{2}\\s(?:year)"))
I also tried this method to get the number after word "years" :
str_extract_all(data,regex(".\\d{2}\\s(?:year).\\d{7,8}"))
I managed to get the number that appear directly after the word years :
" 57 year 7654321"
However, I was unsuccessful in getting eight digit numbers following the word "years" that included other characters in between the number and the word "years".
How can I retrieve the number only after the word "years" by skipping this other word/character?
I really appreciate your help
CodePudding user response:
We may use str_replace
to match and remove the non-digits before and after the 'years' and then extract the digits before and after the years including the 'years'
library(stringr)
str_extract_all(str_replace_all(data,
"(?<=years)\\D |(\\D )(?=years)", " "), "\\d{2}\\s years\\s \\d{7,8}")[[1]]
[1] "45 years 12345678" "57 years 7654321"
Or another option is to capture the digits, along with the 'years' substring with str_match
and then paste
them together
library(purrr)
library(dplyr)
str_match_all(data, "(\\d{2})\\D (years)\\D (\\d{7,8})")[[1]][,-1] %>%
as.data.frame %>%
invoke(str_c, sep =" ", .)
[1] "45 years 12345678" "57 years 7654321"
data
data <- "mr john is 45 years old his number is 12345678, mr doe is 57 years 7654321"
CodePudding user response:
Here is a base R approach:
- Create a list with
strsplit
separating by,
- define a function
my_func
that takes a string and searches for numeric before year and after year and then pastes all together. - Use
lapply
to apply your function to the list. - Use
toString()
to get the expected output.
my_list <- strsplit(data, ",")
my_func <- function(x){
a <- as.integer(sub(".*?(\\d )\\s*year.*", "\\1", x))
b <- as.integer(sub(".*?year.*?(\\d ).*", "\\1", x))
paste(a, "year", b)
}
result <- lapply(my_list, my_func)
lapply(result, toString)
Output:
[[1]]
[1] "45 year 12345678, 57 year 7654321"