Extract value in HTML based on field-CodePudding

I have this HTML example:

<d>
    <d>
        <t>0</t>
        <p>1. Question 1</p>
        <d>12111</d>
        <r>
            <o>A. aaa</o>
            <o>B. Sol B</o>
            <o>C. ccc</o>
            <o>D. ddd</o>
            <o>E. eee</o>
        </r>
    </d>
    <d>
        <t>0</t>
        <p>2. Question 2</p>
        <d>11112</d>
        <r>
            <o>A. aaa</o>
            <o>B. bbb</o>
            <o>C. ccc</o>
            <o>D. ddd</o>
            <o>E. Sol E</o>
        </r>
    </d>
    <d>
        <t>0</t>
        <p>3. Question 3</p>
        <d>21111</d>
        <r>
            <o>A. Sol A</o>
            <o>B. bbb</o>
            <o>C. ccc</o>
            <o>D. ddd</o>
            <o>E. eee</o>
        </r>
    </d>
</d>

I want to parse it to obtain a table with two columns: question and answer.

The question is in the p tag: <p>1. Question 1</p>.

The answer is defined by the position of the number 2 here: <d>12111</d>. So, for question 1, the answer is the second tag: "B. Sol B".

The output should be: | Questions | Answers | | -------- | -------------- | | 1. Question 1 | B. Sol B | | 2. Question 2 | E. Sol E | | 3. Question 3 | A. Sol A |

This is what I have tried, but it does not work very good:

library(dplyr)
library(stringr)
library(rvest)

pg = read_html('
<d>
    <d>
        <t>0</t>
        <p>1. Question 1</p>
        <d>12111</d>
        <r>
            <o>A. aaa</o>
            <o>B. Sol B</o>
            <o>C. ccc</o>
            <o>D. ddd</o>
            <o>E. eee</o>
        </r>
    </d>
    <d>
        <t>0</t>
        <p>2. Question 2</p>
        <d>11112</d>
        <r>
            <o>A. aaa</o>
            <o>B. bbb</o>
            <o>C. ccc</o>
            <o>D. ddd</o>
            <o>E. Sol E</o>
        </r>
    </d>
    <d>
        <t>0</t>
        <p>3. Question 3</p>
        <d>21111</d>
        <r>
            <o>A. Sol A</o>
            <o>B. bbb</o>
            <o>C. ccc</o>
            <o>D. ddd</o>
            <o>E. eee</o>
        </r>
    </d>
</d>', encoding="UTF-8")

pg2 <- pg %>% html_nodes('d') %>% html_elements('d')

long <- length(pg2)
long_loop <- pg2 %>% html_elements('d') %>% length()

df <- data.frame('questions' = character(long), 'answers' = character(long), stringsAsFactors = FALSE)


for( i in 1:long_loop) 
{
  if(i %% 2 == 1)
  {
    pg3 <- pg2 %>% `[[`(1)
    
    pg_question <- pg3 %>% html_elements('p') %>% html_text2()
    pg_soltxt <- pg3 %>% html_elements('d') %>% html_text2()
    pg_solpos <- unlist(gregexpr('2', pg_soltxt))[1]
    pg_answer <- pg3 %>% html_element('r') %>% html_elements("o") %>% html_text2() %>% `[[`(pg_solpos)
    
    
    df[i,1] <- pg_question
    df[i,2] <- pg_answer
  }
}

Probably there is a better way to do it with out the loop, using rvest.

CodePudding user response：

With rvest plus tidyr, you could try:

library(tidyr)
library(dplyr)
library(rvest)

pg |> 
  html_elements(xpath = ".//d/p|//d/d/d|.//d/r/o") |>
  html_text() |> 
  tibble(text = _ ) |> 
  mutate(questions   = if_else(grepl("Question", text) == TRUE,  text, NA_character_),
         indx_answer = if_else(grepl("^\\d $", text)   == TRUE,  text, NA_character_),
         answers     = if_else(grepl("Question", text) == FALSE, text, NA_character_)) |> 
  tidyr::fill(questions, indx_answer,  .direction = "down") |>
  filter(!is.na(answers), answers != indx_answer) |>    
  group_by(questions) |> 
  mutate(group_index = row_number(),
        indx_answer = substr(indx_answer, start = group_index, stop = group_index)) |> 
  filter(indx_answer == 2) |> ungroup() |>
  select(questions, answers)

# A tibble: 3 × 2
  questions     answers 
  <chr>         <chr>   
1 1. Question 1 B. Sol B
2 2. Question 2 E. Sol E
3 3. Question 3 A. Sol A

CodePudding user response：

You could simply create a mask based on the answer key of 2. The initial length of all the 1,2s and the possible answers will be the same and are ordered. You can then filter one by where the value 2 occurs in the other.

library(rvest)
library(dplyr)
library(stringr)

html <- 'your html goes here'

page <- read_html(html)

key <- page |>
  html_elements("p   d") |>
  html_text() |>
  str_split("") |>
  unlist()

answers <- page |>
  html_elements("o") |>
  html_text()

questions <- page |>
  html_elements("p") |>
  html_text()

t <- tibble(
  question = questions,
  answer = answers[key == 2]
)