Home > other >  Scraping information from Booking.com
Scraping information from Booking.com

Time:09-12

I am trying to scrape some information from booking.com. I handled some stuff like pagination, extract title etc.

I am trying to extract the number of guests from here.

enter image description here

This is my code:

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.maximize_window()
test_url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
driver.get(test_url)
time.sleep(3)
soup2 = BeautifulSoup(driver.page_source, 'lxml')
guests = soup2.select_one('span.xp__guests__count')
guests = guests.text if price else None
amenities = soup2.select_one('div.hprt-facilities-block')   

The result is this one '\n2 adults\n·\n\n0 children\n\n·\n\n1 room\n\n'

I know that with some regexp I can extract the information but I want but i would like to understand if is there a way to extract directly the "2 adults" from the above pic.

Thanks.

CodePudding user response:

This is one way to get that information, without using BeautifulSoup (why parse the page twice?):

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
[...]
wait = WebDriverWait(browser, 20)
url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
browser.get(url)
guest_count = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span[class='xp__guests__count']"))).find_element(By.TAG_NAME, "span")
print(guest_count.text)

Result in terminal:

2 adults

Selenium docs can be found at https://www.selenium.dev/documentation/

CodePudding user response:

I haven't used BeautifulSoup. I use Selenium. This is how I would do it in Selenium:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.maximize_window()

test_url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
driver.get(test_url)
time.sleep(3)

element = driver.find_element(By.XPATH,"//span[@class='xp__guests__count']")
adults = int(element.text.split(" adults")[0])
print(str(adults))

Basically, I find the span element that contains the text you are looking for. .text gives you all the inner text (in this case, "2 adults · 0 children · 1 room").

The next line takes only the part of the string that comes before " adults", then casts it as an int.

  • Related