Scraping information from Booking.com-CodePudding

I am trying to scrape some information from booking.com. I handled some stuff like pagination, extract title etc.

I am trying to extract the number of guests from here.

This is my code:

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.maximize_window()
test_url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
driver.get(test_url)
time.sleep(3)
soup2 = BeautifulSoup(driver.page_source, 'lxml')
guests = soup2.select_one('span.xp__guests__count')
guests = guests.text if price else None
amenities = soup2.select_one('div.hprt-facilities-block')

The result is this one '\n2 adults\n·\n\n0 children\n\n·\n\n1 room\n\n'

I know that with some regexp I can extract the information but I want but i would like to understand if is there a way to extract directly the "2 adults" from the above pic.

Thanks.

CodePudding user response：

This is one way to get that information, without using BeautifulSoup (why parse the page twice?):

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
[...]
wait = WebDriverWait(browser, 20)
url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
browser.get(url)
guest_count = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span[class='xp__guests__count']"))).find_element(By.TAG_NAME, "span")
print(guest_count.text)

Result in terminal:

2 adults

Selenium docs can be found at https://www.selenium.dev/documentation/

CodePudding user response：

I haven't used BeautifulSoup. I use Selenium. This is how I would do it in Selenium:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.maximize_window()

test_url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
driver.get(test_url)
time.sleep(3)

element = driver.find_element(By.XPATH,"//span[@class='xp__guests__count']")
adults = int(element.text.split(" adults")[0])
print(str(adults))

Basically, I find the span element that contains the text you are looking for. .text gives you all the inner text (in this case, "2 adults · 0 children · 1 room").

The next line takes only the part of the string that comes before " adults", then casts it as an int.