Home > other >  How to extract a specific field from a text dataset?
How to extract a specific field from a text dataset?

Time:01-04

my data

dfx=structure(list(V1 = c("(Description and Operation, 100-00 General Information) <a data-searchnum=G2107576  data-procuid=G1620638>Acceleration Control - Overview", 
                      "(Description and Operation, 310-02 Acceleration Control) <a data-searchnum=G2232632  data-procuid=G2210282>Acceleration Control - System Operation and Component Description", 
                      "(Description and Operation, 310-02 Acceleration Control) <a data-searchnum=G2232633  data-procuid=G2210283>Acceleration Control", 
                      "(Diagnosis and Testing, 310-02 Acceleration Control) <a data-searchnum=G2118147  data-procuid=G2118148>Accelerator Pedal ")), class = "data.frame", row.names = c(NA, 
                                                                                                                      -4L))

I require to extract the data-searchnum and store it in a new df

G2107576
G2232632
G2232633
G2118147
G2110035

CodePudding user response:

Use str_extract with a capture group ((...)) after the data-searchnum= substring

library(stringr)
str_extract(dfx$V1, 'data-searchnum=(\\S )', group = 1)
[1] "G2107576" "G2232632" "G2232633" "G2118147"

Or str_replace to capture the non-whitespace characters after the data-searchnum= and replace with backreference (\\1)

 str_replace(dfx$V1, ".*data-searchnum=(\\S )\\s .*", "\\1")
[1] "G2107576" "G2232632" "G2232633" "G2118147"

If we are creating a new data

library(dplyr)
df2 <- dfx %>%
       mutate(V1 = str_extract(V1, 'data-searchnum=(\\S )', group = 1))
> df2
        V1
1 G2107576
2 G2232632
3 G2232633
4 G2118147

Or in base R, use the same methodology as in str_replace

sub(".*data-searchnum=(\\S )\\s .*", "\\1", dfx$V1)
[1] "G2107576" "G2232632" "G2232633" "G2118147"
  •  Tags:  
  • r
  • Related