I have a variable called url3 in my R Script.
url3 <- read_html("https://www.booking.com/hotel/mu/legends.en-gb.html?aid=356980&label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ&sid=729aafddc363c28a2c2c7379d7685d87&all_sr_blocks=36363601_246990918_2_85_0&checkin=2022-12-08&checkout=2022-12-15&dest_id=-1354779&dest_type=city&dist=0&from_beach_key_ufi_sr=1&group_adults=2&group_children=0&hapos=1&highlighted_blocks=36363601_246990918_2_85_0&hp_group_set=0&hpos=1&no_rooms=1&sb_price_type=total&sr_order=popularity&sr_pri_blocks=36363601_246990918_2_85_0__29200&srepoch=1619681695&srpvid=51c8354f03be0097&type=total&ucfs=1&req_children=0&req_adults=2&hp_refreshed_with_new_dates=1")
I need to extract 2 pieces of information from that URL and assign them to 2 other variables: checkin date and checkout date.
In the URL, they are located after the words "checkin=" and "checkout=". From the above example, they are stated as: checkin=2022-12-08&checkout=2022-12-15
I would like to assign those dates as follows:
checkin <- 2022-12-08
checkout <- 2022-12-15
I need the dates to be exactly as above; that is, in the YYYY-MM-DD format.
How can I perform this operation in R?
CodePudding user response:
Here's another way using str_extract_all
from package stringr
.
library(stringr)
checkin <- as.Date(str_extract_all(url3, "\\d{4}-\\d{2}-\\d{2}")[[1]][1], "%Y-%m-%d")
checkout <- as.Date(str_extract_all(url3, "\\d{4}-\\d{2}-\\d{2}")[[1]][2], "%Y-%m-%d")
CodePudding user response:
One way to do is,
df = url3 %>% html_nodes('.sb-date-field__display')
checkin = df[html_attr(df, "data-placeholder")=="Check-in date"] %>% html_text2()
"Thursday 8 December 2022"
checkout = df[html_attr(df, "data-placeholder")=="Check-out date"] %>% html_text2()
"Thursday 15 December 2022"
CodePudding user response:
Yet another solution, based on stringr::str_extract
and lookaround:
library(stringr)
checkin <- str_extract(url3, "(?<=checkin\\=)\\d{4}-\\d{2}-\\d{2}")
checkout <- str_extract(url3, "(?<=checkout\\=)\\d{4}-\\d{2}-\\d{2}")
checkin
#> [1] "2022-12-08"
checkout
#> [1] "2022-12-15"