I am new to webscraping with R
and I need help to get through this task. I am trying to scrape data from this specific webpage and I am stuck at a particular point in the whole process.
Here is the URL: webpage
Basically, I am trying to capture 3 elements from the webpage:
(1) Room Type (css selector: .room h3) (2) Meal Plan (css selector: .meal-plan-title) (3) Price (css selector: .price)
I have been able to extract those values from the webpage. However I am having a hard time matching the values as displayed on the webpage.
Here are how my R
codes stand:
library(rvest)
library(dplyr)
library(stringr)
library(tables)
MealPlan <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=") %>%
#html_nodes(".meal-plan-text") %>%
html_nodes(".meal-plan-title") %>%
html_text()
MealPlan
Price <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=") %>%
html_nodes(".price") %>%
html_text()
Price
RoomType <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=") %>%
html_nodes(".room h3") %>%
html_text()
RoomType
I would like to have the output in a data frame as follows:
RoomType MealPlan Price
Chambre Standard Petit Dej. Diner 584 € / pers
Chambre Standard All inclusive 864 € / pers
Chambre Confort Petit Dej. Diner 715 € / pers
Chambre Confort All inclusive 995 € / pers
Bungalow Petit Dej. Diner 781 € / pers
Bungalow All inclusive 1061 € / pers
Chambre Deluxe Petit Dej. Diner 847 € / pers
Chambre Deluxe All inclusive 1127 € / pers
Any help would be highly appreciated.
CodePudding user response:
You could use map_dfr
from purrr
to generate a wide DataFrame with separate columns for mealplans, then pivot_longer
to get them into one column with the price info for the values. The initial list you pass into map_dfr
would be the parent elements representing each room listing, gathered with css selector .room
.
library(rvest)
library(purrr)
library(dplyr)
page <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=")
df <- map_dfr(page |> html_elements(".room"), ~
data.frame(
RoomType = .x |> html_element("h3") |> html_text(),
`Petit Dej. Diner` = .x |> html_element(".price") |> html_text() |> trimws(),
`All inclusive` = .x |> html_element("div:nth-child(5) .price") |> html_text() |> trimws()
)) |>
pivot_longer(!RoomType, names_to = "MealPlan", values_to = "Price")
CodePudding user response:
A slower approach to the answer. I've added the attribute trim = TRUE
to remove extra whitespace.
One issue with MealPlan
is that there are a few with class .noprice
. Oneo way to exclude them is to use xpath
in html_nodes
instead of CSS selectors. I don't know if there is a way to do it with CSS selectors. What I did below was extract both then take a set difference of them.
For the price I've used regular expression to get rid of the extra space in the price.
library(rvest)
library(dplyr)
library(stringr)
url <- "https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]="
Price <- read_html(url) %>%
html_nodes(".price") %>%
html_text(trim = TRUE) %>%
str_replace("(\\d)\\s(\\d)", "\\1\\2")
RoomType <- read_html(url) %>%
html_nodes(".room h3") %>%
html_text(trim = TRUE) %>%
rep(each = 2)
AllMealPlans <- read_html(url) %>%
html_nodes(".meal-plan-text") %>%
html_text(trim = TRUE)
MealPlansNoPrice <- read_html(url) %>%
html_nodes(".noprice .meal-plan-text") %>%
html_text(trim = TRUE)
MealPlan <- setdiff(AllMealPlans, MealPlansNoPrice) %>% rep(times=length(unique(RoomType)))
bind_cols(RoomType = RoomType, MealPlan = MealPlan, Price = Price)