Good morning,
I have a dataframe where one of the columns has observations that look like that:
row1: 28316496(15)|28943784(8)|28579919(7)
row2: 29343898(1)
I would like to create a new column that would extract the numbers that are not in parenthesis, create a list, and then append all these numbers to create a list with all these numbers.
Said differently at the end, I would like to end up with the following list:
28316496;28943784;28579919;29343898
It could also be any other similar object, I am just interested in getting all these numbers and matching them with another dataset.
I have tried using str_extract_all to extract the numbers but I am having trouble understanding the pattern argument. For instance I have tried:
str_extract_all("28316496(15)|28943784(8)", "\d (\d)")
and
gsub("\s*\(.*", "", "28316496(15)|28943784(8)")
but it is not returning exactly what I want.
Any idea for extracting the number outside the brackets and create a giant list out of that?
Thanks a lot!
CodePudding user response:
In base R
, we can use gsub
to remove the (
, followed by the digits and )
, and use read.table
to read it in a data.frame
read.table(text = gsub("\\(\\d \\)", "", df1$col1),
header = FALSE, sep = "|", fill = TRUE)
V1 V2 V3
1 28316496 28943784 28579919
2 29343898 NA NA
Or using str_extract
, use a regex lookaround
library(stringr)
str_extract_all(df1$col1, "\\d (?=\\()")
[[1]]
[1] "28316496" "28943784" "28579919"
[[2]]
[1] "29343898"
data
df1 <- structure(list(col1 = c("28316496(15)|28943784(8)|28579919(7)",
"29343898(1)")), class = "data.frame", row.names = c(NA, -2L))
CodePudding user response:
Here is a way.
x <- c("28316496(15)|28943784(8)|28579919(7)", "29343898(1)")
y <- strsplit(x, "\\|")
y <- lapply(y, \(.y) sub("\\([^\\(\\)] \\)$", "", .y))
y
#> [[1]]
#> [1] "28316496" "28943784" "28579919"
#>
#> [[2]]
#> [1] "29343898"
Created on 2022-09-24 with reprex v2.0.2