Home > Back-end >  scrape object from html with r
scrape object from html with r

Time:12-12

I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is here and I want to extract the daily situation table in the end of the page. The class of this object is

I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since Selector Gadget in that case indicate "No valid path found."

CodePudding user response:

Without getting into the business of writing web scrapers, I think this should help you out:

library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()

CodePudding user response:

There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.

  1. Use the httr library to get the raw html.
  2. Use str_extract from the stringr library to extract the specific piece of data from the html.
  3. I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before). ?(?=text_right_after)
library(httr)
library(stringr)

r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")

normal_care=str_extract(html, regex("(?<=Normal care: ). ?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ). ?(?=</p>)"))
  • Related