BeautifulSoup not returning full html script from airbnb search page-CodePudding

I am trying to use BeautifulSoup and Selenium to scrape data from Airbnb. I want to gather each listing from this search page.

This is what I have so far:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

def scrape_page(page_url):
    
    driver_path = "C:/Users/parkj/Downloads/chromedriver_win32/chromedriver.exe"
    driver = webdriver.Chrome(service = Service(driver_path))
    driver.get(page_url)
    wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'itemprop')))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.close()
    
    return soup

def extract_listing(page_url):
    
    page_soup = scrape_page(page_url)
    listings = page_soup.find_element(By.CLASS_NAME, "itemprop")
    return listings

page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto Prefecture, Japan&date_picker_type=flexible_dates&search_type=unknown"
#items = extract_listing(page_url)

#process items to get all information you need, just an example
#[{'name':items.select_one('[itemprop="name"]')['content'],
#  'url':items.select_one('[itemprop="url"]')['content']} 
# for i in items]

test = scrape_page(page_url)
test

It seems like scrape_page( ) returns the HTML script from the search page, but does not contain the full content. It does not include the information I need, which is this part of the HTML:

Image of HTML Script

I did some research and I saw that WebDriverWait might help, but I get a TimeoutException Error.

TimeoutException Error

The end goal is to get each listing's name and URL. The first 3 items in the resulting list should look similar to this:

[{'name': '✿Kyoto✿/Near Station & Bus/Temple/Twin Room(^^♪✿✿',
  'url': 'www.airbnb.com/rooms/50290730?adults=1&children=0&infants=0&check_in=2022-07-20&check_out=2022-07-27&previous_page_section_name=1000'},
 {'name': 'Stay in Kyoto central island',
  'url': 'www.airbnb.com/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'},
 {'name': '和楽庵【Single】100 Year old Machiya Guest House (1pax)',
  'url': 'www.airbnb.com/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'}]

I apologize ahead if I did not include enough information in this question, as this is my first time posting here. I would appreciate any help, thank you.

CodePudding user response：

I don't use selenium too often but recomend requests.

Try this

from requests import get
from bs4 import BeautifulSoup

headers = {'User-agent':'Mozilla/5.0 (X11; Linux i686; rv:100.0) Gecko/20100101 Firefox/100.0.'}

res = get('https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto Prefecture, Japan&date_picker_type=flexible_dates&search_type=unknown', headers=headers)

soup = BeautifulSoup(res.text, features="html.parser")

url_list = soup.find_all("meta", attrs={"itemprop":"url"})

In my case, it returned 20 results, which is as many as is on one page. If u want more, you need to scrap another page.

The user agent firefox it's very important. It's old scrap case that a lot of page don't block this user agent

CodePudding user response：

Select the elements you are waiting for more specific in this case with css selector:

wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[itemprop="itemListElement"]')))

Also try to avoid selenium syntax with beautifulsoup and also use css selectors in bs3 syntax:

listings = page_soup.select('[itemprop="itemListElement"]')

Example

...
def scrape_page(page_url):
    driver.get(page_url)
    wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[itemprop="itemListElement"]')))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.close()
    
    return soup

def extract_listing(page_url):
    
    page_soup = scrape_page(page_url)
    listings = page_soup.select('[itemprop="itemListElement"]')
    return listings

page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto Prefecture, Japan&date_picker_type=flexible_dates&search_type=unknown"
items = extract_listing(page_url)

#process items to get all information you need, just an example
[{'name':i.select_one('[itemprop="name"]')['content'],
 'url':i.select_one('[itemprop="url"]')['content']} 
for i in items]

Output

[{'name': '✿Kyoto✿/Nähe Bahnhof & Bus/Tempel/Einzelzimmer(^^♪',
  'url': 'www.airbnb.de/rooms/50293998?adults=1&children=0&infants=0&check_in=2022-06-22&check_out=2022-06-29&previous_page_section_name=1000'},
 {'name': '100 Jahre altes Machiya-Gästehaus (1Pax)',
  'url': 'www.airbnb.de/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-08-22&check_out=2022-08-29&previous_page_section_name=1000'},
 {'name': '27, Deluxe Designer Zweibett- / Dreibettzimmer in Shijo (1-3 Personen  / Nichtraucher)',
  'url': 'www.airbnb.de/rooms/41413491?adults=1&children=0&infants=0&check_in=2023-05-16&check_out=2023-05-23&previous_page_section_name=1000'},
 {'name': 'Aufenthalt auf der zentralen Insel Kyoto',
  'url': 'www.airbnb.de/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-24&check_out=2022-07-01&previous_page_section_name=1000'},
 {'name': 'Sweet 202 Privatzimmer ☘️',
  'url': 'www.airbnb.de/rooms/30217767?adults=1&children=0&infants=0&check_in=2022-07-18&check_out=2022-07-25&previous_page_section_name=1000'},
 {'name': 'Kyoto Sanjo Ohashi Superior Zweibettzimmer Studio Nichtraucher Superior Zweibettzimmer',
  'url': 'www.airbnb.de/rooms/45207535?adults=1&children=0&infants=0&check_in=2022-09-27&check_out=2022-10-04&previous_page_section_name=1000'},
 {'name': 'Toller Blick auf den Fluss, schönes traditionelles Haus',
  'url': 'www.airbnb.de/rooms/25762078?adults=1&children=0&infants=0&check_in=2022-12-07&check_out=2022-12-14&previous_page_section_name=1000'},
 {'name': 'Doppelzimmer - Waschmaschine in allen Zimmern ☆ Guest House 10-Minuten zu Fuß von Kyoto Station -',
  'url': 'www.airbnb.de/rooms/51433076?adults=1&children=0&infants=0&check_in=2022-06-13&check_out=2022-06-20&previous_page_section_name=1000'},
 {'name': 'In der Nähe des Bahnhofs Kyoto Gemütliches Zimmer in einem traditionellen Haus',
  'url': 'www.airbnb.de/rooms/25600163?adults=1&children=0&infants=0&check_in=2022-09-12&check_out=2022-09-19&previous_page_section_name=1000'},
 {'name': 'Gemütliche und ruhige zweistöckige japanische Wohnung',
  'url': 'www.airbnb.de/rooms/38743436?adults=1&children=0&infants=0&check_in=2023-03-11&check_out=2023-03-18&previous_page_section_name=1000'},
 {'name': '51★Günstigste★5 Minuten zu Fuß Shin-Osaka Sta.★Max 1 Gäste',
  'url': 'www.airbnb.de/rooms/14539052?adults=1&children=0&infants=0&check_in=2022-07-03&check_out=2022-07-10&previous_page_section_name=1000'},
 {'name': '和楽庵【Doppel】100 Jahre altes Machiya Gästehaus (2pax)',
  'url': 'www.airbnb.de/rooms/22867502?adults=1&children=0&infants=0&check_in=2022-08-26&check_out=2022-09-02&previous_page_section_name=1000'},
 {'name': 'Expo Hostel Nishi #1 /1000yen Fahrrad für deinen Aufenthalt',
  'url': 'www.airbnb.de/rooms/8295322?adults=1&children=0&infants=0&check_in=2022-08-27&check_out=2022-09-03&previous_page_section_name=1000'},
 {'name': '★Lovely RiverSide House in★der Nähe von Einkaufsviertel★3 Betten',
  'url': 'www.airbnb.de/rooms/40117962?adults=1&children=0&infants=0&check_in=2022-07-07&check_out=2022-07-14&previous_page_section_name=1000'},
 {'name': 'ZIMMER - Bereich Central Kyoto Gion',
  'url': 'www.airbnb.de/rooms/15215980?adults=1&children=0&infants=0&check_in=2022-06-14&check_out=2022-06-21&previous_page_section_name=1000'},
 {'name': 'Raum, um das Kyoto zu genießen.',
  'url': 'www.airbnb.de/rooms/9263813?adults=1&children=0&infants=0&check_in=2022-09-08&check_out=2022-09-15&previous_page_section_name=1000'},
 {'name': 'Stilvolles modernes Kyo-Machiya 500 金閣寺 m vom Trockner entfernt',
  'url': 'www.airbnb.de/rooms/20041502?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'},
 {'name': 'Hotel Sou Kyoto Gion Queen Studio',
  'url': 'www.airbnb.de/rooms/40236377?adults=1&children=0&infants=0&check_in=2022-06-22&check_out=2022-06-29&previous_page_section_name=1000'},
 {'name': 'Workation GroLiving in  KYOTO',
  'url': 'www.airbnb.de/rooms/612511811801466646?adults=1&children=0&infants=0&check_in=2022-07-19&check_out=2022-07-26&previous_page_section_name=1000'},
 {'name': '【home quarantin ok】shibainuatiniya/Kyoto Sta/Toji',
  'url': 'www.airbnb.de/rooms/34028813?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'}]

CodePudding user response：

I think you should look for something that can run the javascript in the site, if you want to get the full content of the page. Something like a strip down version of chrome's engine

I dont know if it can do the job but Something like qt web engine