extract info from website with collapsible content using rvest-CodePudding

Website https://www.moe.gov.sg/schoolfinder/schooldetail?schoolname=ZHONGHUA-SECONDARY-SCHOOL

I only want to extract information under the DSA talent areas offered in 2021

However, when I use selector gadget get the path .is--open:nth-child(4) .moe-collapsible__content

dsa <- html_node(listpage,".is--open:nth-child(4) .moe-collapsible__content") %>% html_text() %>% unlist()
dsa

the output is NA

is there any way to get information from the collapsible content?

CodePudding user response：

One way to do is,

library(rvest)
library(dplyr)
library(stringr)

  'https://www.moe.gov.sg/schoolfinder/schooldetail?schoolname=ZHONGHUA-SECONDARY-SCHOOL' %>% 
  read_html() %>% html_nodes('.moe-collapsible__content') %>% html_nodes('.moe-list') %>% html_text() %>% nth(3) %>% str_split('\n')
[[1]]
 [1] "Leadership and Character (Girls and Boys)\r"                                 
 [2] "                                        Chinese Orchestra (Girls and Boys)\r"
 [3] "                                        Choir (Girls and Boys)\r"            
 [4] "                                        Concert Band (Girls and Boys)\r"     
 [5] "                                        Guzheng Ensemble (Girls and Boys)\r" 
 [6] "                                        Badminton (Girls)\r"                 
 [7] "                                        Basketball (Girls)\r"                
 [8] "                                        Table Tennis (Boys)\r"               
 [9] "                                        Volleyball (Boys)\r"

CodePudding user response：

You can be more precise by using :contains with class to target the correct parent div then use a descendant selector to move to the child li elements. By using a partial string you may be able to offer some future proofing for 2022.

library(magrittr)
library(rvest)

read_html("https://www.moe.gov.sg/schoolfinder/schooldetail?schoolname=ZHONGHUA-SECONDARY-SCHOOL") %>%
  html_elements('.moe-collapsible:contains("DSA talent areas") li') %>% html_text()