I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is here and I want to extract the daily situation table in the end of the page. The class of this object is
I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since Selector Gadget in that case indicate "No valid path found."
CodePudding user response:
Without getting into the business of writing web scrapers, I think this should help you out:
library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()
CodePudding user response:
There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.
- Use the httr library to get the raw html.
- Use str_extract from the stringr library to extract the specific piece of data from the html.
- I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before). ?(?=text_right_after)
library(httr)
library(stringr)
r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")
normal_care=str_extract(html, regex("(?<=Normal care: ). ?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ). ?(?=</p>)"))