I am trying to scrape the data located in a table on this webpage (URL provided in R codes below).
My R codes do extract the Table but I would like to trim the information found in the column called called " Your Choices"
I don't want all the data in the column called "Your Choices". I want to extract only those texts which come before the first " \n".
Here are my R Codes:
library(tidyverse)
content <- read_html("https://www.booking.com/hotel/mu/tamarin.en-gb.html?aid=356980&label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ&sid=729aafddc363c28a2c2c7379d7685d87&all_sr_blocks=36363601_246990918_2_85_0&checkin=2021-11-15&checkout=2021-11-20&dest_id=-1354779&dest_type=city&dist=0&from_beach_key_ufi_sr=1&group_adults=2&group_children=0&hapos=1&highlighted_blocks=36363601_246990918_2_85_0&hp_group_set=0&hpos=1&no_rooms=1&sb_price_type=total&sr_order=popularity&sr_pri_blocks=36363601_246990918_2_85_0__29200&srepoch=1619681695&srpvid=51c8354f03be0097&type=total&ucfs=1&req_children=0&req_adults=2&hp_refreshed_with_new_dates=1")
tables <- content %>% html_table(fill = TRUE)
View(tables)
second_table <- tables[[2]]
View(second_table)
In RStudio, the data from the column "Your Choices" are as follows (extract shown):
1 FREE cancellation\nbefore 23:59 on 12 November 2021\n\n\n\n\n\n\n\n\n\n10% base-rate discount available\n\n\n\n\n\n\nMeals:\nThere is no meal option with this room.\n\n\nCancellation:\n\nYou may cancel free of charge until 2 days before arrival. You will be charged the total price of the reservation if you cancel in the 2 days before arrival. If you don’t show up you will be charged the total price of the reservation.\n\n\nPrepayment:\nYou will be charged a prepayment of the total price at any time.
2 Good breakfast included\n\n\n\n\n\n\n\n\nFREE cancellation\nbefore 23:59 on 12 November 2021\n\n\n\n\n\n\n\n\n\n10% base-rate discount available\n\n\n\n\n\n\nMeals:\nContinental breakfast included\nBreakfast rated 7.6 - based on 38 reviews.\n\n\n\nCancellation:\n\nYou may cancel free of charge until 2 days before arrival. You will be charged the total price of the reservation if you cancel in the 2 days before arrival. If you don’t show up you will be charged the total price of the reservation.\n\n\nPrepayment:\nYou will be charged a prepayment of the total price at any time.
3 Breakfast & dinner included\n\n\n\n\n\n\n\n\n\nFREE cancellation\nbefore 23:59 on 12 November 2021\n\n\n\n\n\n\n\n\n\n10% base-rate discount available\n\n\n\n\n\n\nMeals:\nHalf board is included in the room rate.\nBreakfast rated 7.6 - based on 38 reviews.\n\n\n\nCancellation:\n\nYou may cancel free of charge until 2 days before arrival. You will be charged the total price of the reservation if you cancel in the 2 days before arrival. If you don’t show up you will be charged the total price of the reservation.\n\n\nPrepayment:\nYou will be charged a prepayment of the total price at any time.
To simplify, I would like the "Your Choices" column to look like this:
Your Choices
FREE cancellation
Good breakfast included
Breakfast & dinner included
How can I achieve this?
CodePudding user response:
You may use sub
to drop everything after first "\n"
-
library(tidyverse)
library(rvest)
tables <- content %>% html_table(fill = TRUE)
second_table <- tables[[2]]
second_table$`Your choices` <- sub('\n.*', '', second_table$`Your choices`)
second_table$`Your choices`
# [1] "FREE cancellation" "Good breakfast included"
# [3] "Breakfast & dinner included" "All-Inclusive"
# [5] "Breakfast & dinner included" "All-Inclusive"
# [7] "FREE cancellation" "Breakfast & dinner included"
# [9] "Good breakfast included" "All-Inclusive"
#[11] "Breakfast & dinner included" "All-Inclusive"
#[13] "FREE cancellation" "Good breakfast included"
#[15] "Breakfast & dinner included" "All-Inclusive"
#[17] "Breakfast & dinner included" "FREE cancellation"
#[19] "Good breakfast included" "All-Inclusive"
#[21] "Breakfast & dinner included" "All-Inclusive"