Home > Software engineering >  Is there an R-function to extract specific string variables
Is there an R-function to extract specific string variables

Time:09-22

I have a df in R, where one of the columns is long html sting that contains alot of arguments. I want to extract specific values into news columns:

This is what I have:

name <- c("John", "Max")
bio <- c("<status>1</status><profession>Revisor</professio>", "<status>1</status><born>19.06.1995</born><profession>Tech</professio>" )

df <- data.frame(name, bio)

This is what I want:

name <- c("John", "Max")
status <- c(1,1)
profession <- c("Revisor", "Tech")

df <- data.frame(name, status, profession)

CodePudding user response:

This can be done easily with extract:

library(tidyr)
df %>%
  extract(bio,
          into = c("status", "profession"),
          regex = "<status>(\\d )</status>.*<profession>(\\w )</professio>")
  name status profession
1 John      1    Revisor
2  Max      1       Tech

The regex part describes the strings from which elements should be extracted in full, while capturing the elements of interest in capture groups defined by ().

Alternatively, you can use str_extract:

library(stringr)
df$status <- str_extract(bio, pattern = "(?<=<status>)\\d (?=</status>)")
df$profession <- str_extract(bio, pattern = "(?=<profession>)\\w (?=</professio>)")

Here we are making use of lookarounds to conditionally match, for example:

  • (?<=<status>): positive lookbehind to assert that the match must be preceded by the literal string <status>
  • \\d : one or more digits
  • (?=</status>): positive lookahead to assert that the match must be followed by the literal string </status>

CodePudding user response:

Is is possible by using "regular expression" with the R package stringr. Here is an example :

library(stringr)
name <- c("John", "Max")
bio <- c("<status>1</status><profession>Revisor</professio>", "<status>1</status><born>19.06.1995</born><profession>Tech</professio>" )

status <- stringr::str_extract_all(bio, pattern = "<status>\\d</status>")
status <- stringr::str_replace_all(status, pattern = "(<status>)(\\d)(</status>)", "\\2")

profession <- stringr::str_extract_all(bio, pattern = "<profession>[:alpha:]*</professio>")
profession <- stringr::str_replace_all(profession, pattern = "(<profession>)([:alpha:]*)(</professio>)", "\\2")

df <- data.frame(name, status, profession)
  • Related