Home > database >  How to use R to retrieve certain characters from multiple places in a string
How to use R to retrieve certain characters from multiple places in a string

Time:07-12

In my dataset, I have a column contains strings like this:

id<-c(1:4)
colstr<-c("<div ><p>107. <span style="font-weight: normal;">Did the </span>Goodie bag<span style="font-weight: normal;"> encourage you to go back for your month one PrEP refill?</span></p></div>","<div ><p>110. Have you ever seen the <span style="color: #3598db;">brochure</span> that is contained in the 'Goodie Bag'?</p></div>","<div ><p>116. <span style="font-weight: normal;">Have you ever used the </span>call-in line<span style="font-weight: normal;"> phone number on the brochure</span>?</p></div>","<div class='box-body'><b><p style="text-transform:uppercase; border:1px solid black;padding:2px;color:blue"><span style="display:block;border:1px solid grey;padding:10px">Review the data entered and make sure there is <i style="color:red">*no missing data*</i>.<br/>Thereafter, click on <i style="color:red">save & exit record</i> to save this interview</span></p></b></div>")

df<-data.frame(id, colstr)

For the column: "colstr", if I only want to keep the words outside of "<xxxx>", for example, ideal result like this:

id    colstr
1    107. Did the Goodie bag encourage you to go back for your month one PrEP refill?
2    110. Have you ever seen the brochure that is contained in the 'Goodie Bag'? 
....

Like the example that I need retrieve a whole sentence from different places of a string cut by irregular , How should I write a code in R and set up a pattern in that code to successfully retrieve the words I want? Thanks a lot~~!

CodePudding user response:

One approach, assuming the HTML tags be not nested, would be to simply strip off all opening and closing tags:

df$colstr <- gsub("</?.*?>", "", df$colstr)

CodePudding user response:

Your text really looks like HTML code. Have you looked into the RVest Package? You could actually read your HTML code and keep all the information. And then when needed extract the text out of the HTML code. This would be a lot cleaner and easier way to do want you want.

an example would be:

colstr <- read_html("https://www.youwebsite.html") %>% 
  html_text2()
  •  Tags:  
  • r
  • Related