I have a XML file with text. Sometimes in the text, I have a mention . I would like to fill my dataframe to know with the text in italic, if it appears.
<question>
<text>Just text</text>
<text>Text with reaction<italic>Reaction</italic></text>
</question>
And I would like a dataframe like this :
text italic
Just text NA
Text with reaction Reaction
Here is my code so far :
library(tidyverse)
library(XML)
question = xmlParse("
<question>
<text>Just text</text>
<text>Text with reaction<italic>Reaction</italic></text>
</question>")
xml_data_question <- xmlToList(question)
#Create dataframe (df) #
df_question <- data.frame(matrix(ncol = 2, nrow = 2))
x = c("text", "remarks")
colnames(df_question) <- x
#Fill dataframe
df_question$text[1] = xml_data_question[1]$text
# df_question$remarks[1] = ifelse(xml_data_question[1]$text$italic != NULL, xml_data_question[1]$text$italic, NA)
df_question$text[2] = xml_data_question[2]$text
df_question$remarks[2] = ifelse(xml_data_question[2]$text$italic != NULL, xml_data_question[2]$text$italic, NA)
Of course, df_question$remarks[1] = xml_data_question[1]$text$italic
does not work because it does not exist. But I would like to know how I can return an error or something to then include it in my dataframe.
CodePudding user response:
I prefer using the xml2 package instead of the XML package, I find the syntax easier to use.
Here is non compact answer but it gets the job done. The code's comments explain the process step by step.
library(xml2)
library(dplyr)
library(stringr)
page <- read_xml("<question>
<text>Just text</text>
<text>Text with reaction<italic>Reaction</italic></text>
</question>")
#find the lines of text
textchunks <- xml_find_all(page, ".//text")
#get the plain text
text <- textchunks %>% xml_text()
#get the plain text of the italic portions
italic <- textchunks %>% xml_find_first( ".//italic") %>% xml_text()
answer <- data.frame(text, italic)
#replace the italic in the orignal text with nothing
answer$text <- if_else(is.na(answer$italic), answer$text, stringr::str_replace(answer$text, answer$italic, ""))
answer
# text italic
# 1 Just text <NA>
# 2 Text with reaction Reaction
This does assume only 1 block per code and the italic block does not match within the plain text portion.