I have a df in R, where one of the columns is long html sting that contains alot of arguments. I want to extract specific values into news columns:
This is what I have:
name <- c("John", "Max")
bio <- c("<status>1</status><profession>Revisor</professio>", "<status>1</status><born>19.06.1995</born><profession>Tech</professio>" )
df <- data.frame(name, bio)
This is what I want:
name <- c("John", "Max")
status <- c(1,1)
profession <- c("Revisor", "Tech")
df <- data.frame(name, status, profession)
CodePudding user response:
This can be done easily with extract
:
library(tidyr)
df %>%
extract(bio,
into = c("status", "profession"),
regex = "<status>(\\d )</status>.*<profession>(\\w )</professio>")
name status profession
1 John 1 Revisor
2 Max 1 Tech
The regex
part describes the strings from which elements should be extracted in full, while capturing the elements of interest in capture groups defined by ()
.
Alternatively, you can use str_extract
:
library(stringr)
df$status <- str_extract(bio, pattern = "(?<=<status>)\\d (?=</status>)")
df$profession <- str_extract(bio, pattern = "(?=<profession>)\\w (?=</professio>)")
Here we are making use of lookarounds to conditionally match, for example:
(?<=<status>)
: positive lookbehind to assert that the match must be preceded by the literal string<status>
\\d
: one or more digits(?=</status>)
: positive lookahead to assert that the match must be followed by the literal string</status>
CodePudding user response:
Is is possible by using "regular expression" with the R package stringr. Here is an example :
library(stringr)
name <- c("John", "Max")
bio <- c("<status>1</status><profession>Revisor</professio>", "<status>1</status><born>19.06.1995</born><profession>Tech</professio>" )
status <- stringr::str_extract_all(bio, pattern = "<status>\\d</status>")
status <- stringr::str_replace_all(status, pattern = "(<status>)(\\d)(</status>)", "\\2")
profession <- stringr::str_extract_all(bio, pattern = "<profession>[:alpha:]*</professio>")
profession <- stringr::str_replace_all(profession, pattern = "(<profession>)([:alpha:]*)(</professio>)", "\\2")
df <- data.frame(name, status, profession)