How to scrape data from this specific webpage and save the output in a data frame?-CodePudding

I am new to webscraping with R and I need help to get through this task. I am trying to scrape data from this specific webpage and I am stuck at a particular point in the whole process.

Here is the URL: webpage

Basically, I am trying to capture 3 elements from the webpage:

(1) Room Type (css selector: .room h3) (2) Meal Plan (css selector: .meal-plan-title) (3) Price (css selector: .price)

I have been able to extract those values from the webpage. However I am having a hard time matching the values as displayed on the webpage.

Here are how my R codes stand:

library(rvest)
library(dplyr)
library(stringr)
library(tables)

MealPlan <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=") %>%
#html_nodes(".meal-plan-text") %>%
html_nodes(".meal-plan-title") %>%
html_text()

MealPlan

Price <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=") %>%
  html_nodes(".price") %>%
  html_text()

Price


RoomType <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=") %>%
  html_nodes(".room h3") %>%
  html_text()

RoomType

I would like to have the output in a data frame as follows:

   RoomType               MealPlan         Price

Chambre Standard     Petit Dej. Diner    584 € / pers
Chambre Standard     All inclusive       864 € / pers
Chambre Confort      Petit Dej. Diner    715 € / pers
Chambre Confort      All inclusive       995 € / pers
Bungalow             Petit Dej. Diner    781 € / pers
Bungalow             All inclusive       1061 € / pers
Chambre Deluxe       Petit Dej. Diner    847 € / pers
Chambre Deluxe       All inclusive       1127 € / pers

Any help would be highly appreciated.

CodePudding user response：

You could use map_dfr from purrr to generate a wide DataFrame with separate columns for mealplans, then pivot_longer to get them into one column with the price info for the values. The initial list you pass into map_dfr would be the parent elements representing each room listing, gathered with css selector .room.

library(rvest)
library(purrr)
library(dplyr)

page <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=")

df <- map_dfr(page |> html_elements(".room"), ~
data.frame(
  RoomType = .x |> html_element("h3") |> html_text(),
  `Petit Dej. Diner` = .x |> html_element(".price") |> html_text() |> trimws(),
  `All inclusive` = .x |> html_element("div:nth-child(5) .price") |> html_text() |> trimws()
)) |>
  pivot_longer(!RoomType, names_to = "MealPlan", values_to = "Price")

CodePudding user response：

A slower approach to the answer. I've added the attribute trim = TRUE to remove extra whitespace.

One issue with MealPlan is that there are a few with class .noprice. Oneo way to exclude them is to use xpath in html_nodes instead of CSS selectors. I don't know if there is a way to do it with CSS selectors. What I did below was extract both then take a set difference of them.

For the price I've used regular expression to get rid of the extra space in the price.

library(rvest)
library(dplyr)
library(stringr)

url <- "https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]="

Price <- read_html(url) %>%
  html_nodes(".price") %>%
  html_text(trim = TRUE) %>% 
  str_replace("(\\d)\\s(\\d)", "\\1\\2")

RoomType <- read_html(url) %>%
  html_nodes(".room h3") %>%
  html_text(trim = TRUE) %>% 
  rep(each = 2)

AllMealPlans <- read_html(url) %>%
  html_nodes(".meal-plan-text") %>%
  html_text(trim = TRUE)

MealPlansNoPrice <- read_html(url) %>%
  html_nodes(".noprice .meal-plan-text") %>%
  html_text(trim = TRUE)

MealPlan <- setdiff(AllMealPlans, MealPlansNoPrice) %>% rep(times=length(unique(RoomType)))
  
bind_cols(RoomType = RoomType, MealPlan = MealPlan, Price = Price)