Home > database >  How can I extract this specific information from a URL and assign it to a variable?
How can I extract this specific information from a URL and assign it to a variable?

Time:03-19

I have a variable called url3 in my R Script.

url3 <- read_html("https://www.booking.com/hotel/mu/legends.en-gb.html?aid=356980&label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ&sid=729aafddc363c28a2c2c7379d7685d87&all_sr_blocks=36363601_246990918_2_85_0&checkin=2022-12-08&checkout=2022-12-15&dest_id=-1354779&dest_type=city&dist=0&from_beach_key_ufi_sr=1&group_adults=2&group_children=0&hapos=1&highlighted_blocks=36363601_246990918_2_85_0&hp_group_set=0&hpos=1&no_rooms=1&sb_price_type=total&sr_order=popularity&sr_pri_blocks=36363601_246990918_2_85_0__29200&srepoch=1619681695&srpvid=51c8354f03be0097&type=total&ucfs=1&req_children=0&req_adults=2&hp_refreshed_with_new_dates=1")

I need to extract 2 pieces of information from that URL and assign them to 2 other variables: checkin date and checkout date.

In the URL, they are located after the words "checkin=" and "checkout=". From the above example, they are stated as: checkin=2022-12-08&checkout=2022-12-15

I would like to assign those dates as follows:

checkin <- 2022-12-08
checkout <- 2022-12-15

I need the dates to be exactly as above; that is, in the YYYY-MM-DD format.

How can I perform this operation in R?

CodePudding user response:

Here's another way using str_extract_all from package stringr.

library(stringr)
checkin <- as.Date(str_extract_all(url3, "\\d{4}-\\d{2}-\\d{2}")[[1]][1], "%Y-%m-%d")
checkout <- as.Date(str_extract_all(url3, "\\d{4}-\\d{2}-\\d{2}")[[1]][2], "%Y-%m-%d")

CodePudding user response:

One way to do is,

df = url3 %>% html_nodes('.sb-date-field__display') 

checkin  = df[html_attr(df, "data-placeholder")=="Check-in date"] %>% html_text2()
"Thursday 8 December 2022"

checkout  = df[html_attr(df, "data-placeholder")=="Check-out date"] %>% html_text2()
"Thursday 15 December 2022"

CodePudding user response:

Yet another solution, based on stringr::str_extract and lookaround:

library(stringr)

checkin <- str_extract(url3, "(?<=checkin\\=)\\d{4}-\\d{2}-\\d{2}")
checkout <- str_extract(url3, "(?<=checkout\\=)\\d{4}-\\d{2}-\\d{2}")

checkin
#> [1] "2022-12-08"

checkout
#> [1] "2022-12-15"
  •  Tags:  
  • r
  • Related