Home > Blockchain >  R and XML - check if there is italic inside text
R and XML - check if there is italic inside text

Time:01-22

I have a XML file with text. Sometimes in the text, I have a mention . I would like to fill my dataframe to know with the text in italic, if it appears.

<question>
    <text>Just text</text>
    <text>Text with reaction<italic>Reaction</italic></text>
</question>

And I would like a dataframe like this :

text                 italic
Just text            NA
Text with reaction   Reaction   

Here is my code so far :

library(tidyverse)
library(XML)

question = xmlParse("
<question>
    <text>Just text</text>
    <text>Text with reaction<italic>Reaction</italic></text>
</question>")
xml_data_question <- xmlToList(question)

#Create dataframe (df) #
df_question <- data.frame(matrix(ncol = 2, nrow = 2))
x = c("text", "remarks")
colnames(df_question) <- x


#Fill dataframe
df_question$text[1] = xml_data_question[1]$text
# df_question$remarks[1] = ifelse(xml_data_question[1]$text$italic != NULL, xml_data_question[1]$text$italic, NA)
df_question$text[2] = xml_data_question[2]$text
df_question$remarks[2] = ifelse(xml_data_question[2]$text$italic != NULL, xml_data_question[2]$text$italic, NA)

Of course, df_question$remarks[1] = xml_data_question[1]$text$italic does not work because it does not exist. But I would like to know how I can return an error or something to then include it in my dataframe.

CodePudding user response:

I prefer using the xml2 package instead of the XML package, I find the syntax easier to use.

Here is non compact answer but it gets the job done. The code's comments explain the process step by step.

library(xml2)
library(dplyr)
library(stringr)

page <- read_xml("<question>
    <text>Just text</text>
    <text>Text with reaction<italic>Reaction</italic></text>
</question>")

#find the lines of text
textchunks <- xml_find_all(page, ".//text")

#get the plain text
text <- textchunks %>% xml_text()

#get the plain text of the italic portions
italic <- textchunks %>% xml_find_first( ".//italic") %>% xml_text()

answer <- data.frame(text, italic)
#replace the italic in the orignal text with nothing
answer$text <- if_else(is.na(answer$italic), answer$text, stringr::str_replace(answer$text, answer$italic, ""))

answer
#                 text   italic
# 1          Just text     <NA>
# 2 Text with reaction Reaction

This does assume only 1 block per code and the italic block does not match within the plain text portion.

  •  Tags:  
  • rxml
  • Related