Home > Software engineering >  R - Extract numbers before characters, create a list
R - Extract numbers before characters, create a list

Time:09-27

Good morning,

I have a dataframe where one of the columns has observations that look like that:

row1: 28316496(15)|28943784(8)|28579919(7)

row2: 29343898(1)

I would like to create a new column that would extract the numbers that are not in parenthesis, create a list, and then append all these numbers to create a list with all these numbers.

Said differently at the end, I would like to end up with the following list:

28316496;28943784;28579919;29343898

It could also be any other similar object, I am just interested in getting all these numbers and matching them with another dataset.

I have tried using str_extract_all to extract the numbers but I am having trouble understanding the pattern argument. For instance I have tried:

str_extract_all("28316496(15)|28943784(8)", "\d (\d)")

and

gsub("\s*\(.*", "", "28316496(15)|28943784(8)")

but it is not returning exactly what I want.

Any idea for extracting the number outside the brackets and create a giant list out of that?

Thanks a lot!

CodePudding user response:

In base R, we can use gsub to remove the (, followed by the digits and ), and use read.table to read it in a data.frame

read.table(text = gsub("\\(\\d \\)", "", df1$col1), 
    header = FALSE, sep = "|", fill = TRUE)
        V1       V2       V3
1 28316496 28943784 28579919
2 29343898       NA       NA

Or using str_extract, use a regex lookaround

library(stringr)
str_extract_all(df1$col1, "\\d (?=\\()")
[[1]]
[1] "28316496" "28943784" "28579919"

[[2]]
[1] "29343898"

data

df1 <- structure(list(col1 = c("28316496(15)|28943784(8)|28579919(7)", 
"29343898(1)")), class = "data.frame", row.names = c(NA, -2L))

CodePudding user response:

Here is a way.

x <- c("28316496(15)|28943784(8)|28579919(7)", "29343898(1)")

y <- strsplit(x, "\\|")
y <- lapply(y, \(.y) sub("\\([^\\(\\)] \\)$", "", .y))
y
#> [[1]]
#> [1] "28316496" "28943784" "28579919"
#> 
#> [[2]]
#> [1] "29343898"

Created on 2022-09-24 with reprex v2.0.2

  • Related