If I query my target link as follows:
library(jsonlite)
link <- "https://www.forest-trends.org/wp-content/themes/foresttrends/map_tools/project_fetch_single.php?pid=1"
df <- fromJSON(link)
I get a JSON list with one element: df$html
. I would like to parse this HTML using rvest
in order to access tags like psize
and pstatus
. But the double backslashes \\
seem to stop me. Any idea how to formulate my rvest
query correctly? I'm thinking of something like:
df$html %>% html_node(xpath = '//div[contains(@class, \"psize\")]') %>% html_text()
CodePudding user response:
Combining a few different functions, you can arrive to that. This is not suppose to be a 100% correct answer, but it can give some ideas about how to format the string.
library(rvest)
library(tidyr)
split <- read_html(link) %>%
html_node(xpath='/html/body/div') %>%
html_text() %>%
strsplit(., split = "\\\\n|\\\\t")
split <- split[[1]][!is.na(split[[1]]) & split[[1]] != ""]
data.frame(col1 = split[1:5]) %>%
separate(col = col1, into = c("col1", "col2"), sep = ": ", extra = "drop")
col1 col2
1 Size 85000 ha
2 Status In development
3 Description REDD project in Madre de Dios, Peru
4 Objective Carbon sequestration or avoided, Carbon sequestration or avoided
5 Interventions Afforestation or reforestation