Home > Software engineering >  How to extract this specific value from the cell of a table from this webpage?
How to extract this specific value from the cell of a table from this webpage?

Time:10-25

I am trying to scrape the data located in a table on this webpage (URL provided in R codes below).

My R codes do extract the Table but I would like to trim the information found in the column called called " Your Choices"

I don't want all the data in the column called "Your Choices". I want to extract only those texts which come before the first " \n".

Here are my R Codes:

library(tidyverse)

content <- read_html("https://www.booking.com/hotel/mu/tamarin.en-gb.html?aid=356980&label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ&sid=729aafddc363c28a2c2c7379d7685d87&all_sr_blocks=36363601_246990918_2_85_0&checkin=2021-11-15&checkout=2021-11-20&dest_id=-1354779&dest_type=city&dist=0&from_beach_key_ufi_sr=1&group_adults=2&group_children=0&hapos=1&highlighted_blocks=36363601_246990918_2_85_0&hp_group_set=0&hpos=1&no_rooms=1&sb_price_type=total&sr_order=popularity&sr_pri_blocks=36363601_246990918_2_85_0__29200&srepoch=1619681695&srpvid=51c8354f03be0097&type=total&ucfs=1&req_children=0&req_adults=2&hp_refreshed_with_new_dates=1")

tables <- content %>% html_table(fill = TRUE)

View(tables)


second_table <- tables[[2]]

View(second_table)

In RStudio, the data from the column "Your Choices" are as follows (extract shown):

1 FREE cancellation\nbefore 23:59 on 12 November 2021\n\n\n\n\n\n\n\n\n\n10% base-rate discount available\n\n\n\n\n\n\nMeals:\nThere is no meal option with this room.\n\n\nCancellation:\n\nYou may cancel free of charge until 2 days before arrival. You will be charged the total price of the reservation if you cancel in the 2 days before arrival. If you don’t show up you will be charged the total price of the reservation.\n\n\nPrepayment:\nYou will be charged a prepayment of the total price at any time.

2 Good breakfast included\n\n\n\n\n\n\n\n\nFREE cancellation\nbefore 23:59 on 12 November 2021\n\n\n\n\n\n\n\n\n\n10% base-rate discount available\n\n\n\n\n\n\nMeals:\nContinental breakfast included\nBreakfast rated 7.6 - based on 38 reviews.\n\n\n\nCancellation:\n\nYou may cancel free of charge until 2 days before arrival. You will be charged the total price of the reservation if you cancel in the 2 days before arrival. If you don’t show up you will be charged the total price of the reservation.\n\n\nPrepayment:\nYou will be charged a prepayment of the total price at any time.

3 Breakfast & dinner included\n\n\n\n\n\n\n\n\n\nFREE cancellation\nbefore 23:59 on 12 November 2021\n\n\n\n\n\n\n\n\n\n10% base-rate discount available\n\n\n\n\n\n\nMeals:\nHalf board is included in the room rate.\nBreakfast rated 7.6 - based on 38 reviews.\n\n\n\nCancellation:\n\nYou may cancel free of charge until 2 days before arrival. You will be charged the total price of the reservation if you cancel in the 2 days before arrival. If you don’t show up you will be charged the total price of the reservation.\n\n\nPrepayment:\nYou will be charged a prepayment of the total price at any time.

To simplify, I would like the "Your Choices" column to look like this:

  Your Choices

FREE cancellation
Good breakfast included
Breakfast & dinner included

How can I achieve this?

CodePudding user response:

You may use sub to drop everything after first "\n" -

library(tidyverse)
library(rvest)

tables <- content %>% html_table(fill = TRUE)
second_table <- tables[[2]]
second_table$`Your choices` <- sub('\n.*', '', second_table$`Your choices`)

second_table$`Your choices`

# [1] "FREE cancellation"           "Good breakfast included"    
# [3] "Breakfast & dinner included" "All-Inclusive"              
# [5] "Breakfast & dinner included" "All-Inclusive"              
# [7] "FREE cancellation"           "Breakfast & dinner included"
# [9] "Good breakfast included"     "All-Inclusive"              
#[11] "Breakfast & dinner included" "All-Inclusive"              
#[13] "FREE cancellation"           "Good breakfast included"    
#[15] "Breakfast & dinner included" "All-Inclusive"              
#[17] "Breakfast & dinner included" "FREE cancellation"          
#[19] "Good breakfast included"     "All-Inclusive"              
#[21] "Breakfast & dinner included" "All-Inclusive"            
  • Related