Scrape Data from multiple urls from Airbnb with Python-CodePudding

I manage to scrape all the data from the landing page of AirBnB (Price, Name, Ratings etc.), I also know how to use a loop in order use the pagination in order to scrape data from multiple pages.

What I would like to do is to scrape data for each specific listing, i.e data which is within the listing page (description, amenities, etc.).

What I was thinking is to implement same logic as the pagination since I have a list with links but it's difficult to me to understand how can I do it.

Here is the code to scrape the links Imports

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
import pandas as pd
import time

airbnb_url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=unknown'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(airbnb_url)

driver.maximize_window()
time.sleep(5)

Links

links = []
soup=BeautifulSoup(driver.page_source, 'lxml')
for card in soup.select('div[]'):
    links.append('https://www.airbnb.com'   card.select_one('a[]')['href'])

What I used to extract the "where to sleep" section is this but probably I am using a wrong tag.

amenities = []
for url in links:
    driver.get(url)
    soup1 = BeautifulSoup(driver.page_source, 'lxml')
    for amenity in soup1.select('div[]'):
        amenities.append(amenity.select_one('div[]'))

My first question was that and the other one is if any nobody how can I scrape the availability of each listing.

Thanks a lot!

CodePudding user response：

BeautifulSoup is more comfortable in some situations, but not allways needed in your scraping process - Also avoid using time and switch to selenium waits

Cause there should only be one issue in your question I focus on the first. To scrape all of the amenities you have to open the modal via button click:

[i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))]

Note: To avoid errors while other elements get the clicks I initally handle the cookie banners

Example

Just to point in a direction limited scraping to urls[:5], simply skip it to get all results.

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
airbnb_url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=unknown'

driver.get(airbnb_url)

driver.maximize_window()

WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-testid="main-cookies-banner-container"] button'))).click()

WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-testid="save-btn"]'))).click()

urls = list(set(a.get_attribute('href') for a in driver.find_elements(By.CSS_SELECTOR,'[itemprop="itemListElement"] a')))

data = []

for url in urls[:5]:
    driver.get(url)
    WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="AMENITIES_DEFAULT"] button'))).click()
    soup = BeautifulSoup(driver.page_source)
    
    data.append({
        'title':soup.h1.text,
        'amenities':[i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))]
    })

pd.DataFrame(data)

Output

    title                                               amenities
0   Ein innovatives, gemütliches, unkonventionelle...   [Föhn, Shampoo, Warmwasser, Duschgel, Waschmas...
1   Stadtzentrum, Meerblick, Große private Terrasse [Blick auf die Skyline der Stadt, Seeblick, Bl...
2   Innovative, atemberaubende Wohnung im obersten...   [Föhn, Shampoo, Warmwasser, Duschgel, Waschmas...
3   Budget-Wohnung - zentralster Ort    [TV, Klimaanlage, Heizung, WLAN, Arbeitsplatz,...
4   Diameno Studio Center Town  [Föhn, Reinigungsprodukte, Shampoo, Warmwasser...

CodePudding user response：

You wanna scrape total listing pages along with its each details page. So, each page contains total item numbers 20 aka 20 offset items meaning each time on each page's offset is incremented by 20 listing items.I've made the pagination in starting url by following offset then go to details page invoking driver and soup for the second time and from the details page, you have to extract all the necessary information.

There 15 pages and a single page has 20 listing items so total listing is 15*20 = 300 listings. I've scraped 6 pages meaning 120 items meaning(0,120,20).You can pull all (0,300,20) items just inject them inside the range function. At first, test my code aftermath scrape all the pages as selenium is a bit slow so it will consume a bit more time.

Script:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
import pandas as pd
import time

url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=user_map_move&price_filter_input_type=0&ne_lat=40.66256734970964&ne_lng=23.003752862853986&sw_lat=40.59051931897441&sw_lng=22.892087137145978&zoom=13&search_by_map=true&federated_search_session_id=1ed21e1c-0c5e-4529-ab84-267361eac02b&pagination_search=true&items_offset={offset}&section_offset=2'

data = []
for offset in range(0,120,20):
    driver.get(url.format(offset=offset))
    driver.maximize_window()
    time.sleep(3)
    soup=BeautifulSoup(driver.page_source, 'lxml')


    detailed_pages = []
    for card in soup.select('div[]'):
        link = 'https://www.airbnb.com'   card.select_one('a[]').get('href')
        detailed_pages.append(link)


    for page in detailed_pages:
        driver.get(page)
        driver.maximize_window()
        time.sleep(2)
        soup2=BeautifulSoup(driver.page_source, 'lxml')
        price = soup2.select_one('span._tyxjp1')
        price = price.text if price else None
        rating= soup2.select_one('span._12si43g')
        rating = rating.text if rating else None
        Bedroom_area = soup2.select_one('div[]')
        Bedroom_area = Bedroom_area.text if Bedroom_area else None
        place_offers= ', '.join([x.get_text(strip=True) for x in soup2.select('[] div:nth-of-type(3) > div')])
        data.append({
            'place_offers': place_offers,
            'price':price,
            'rating':rating,
            'Bedroom_area': Bedroom_area
        })

df=pd.DataFrame(data)
print(df)

Output:

       place_offers                                     price   rating             Bedroom_area
0                                                        $23    None                     None
1                                                        $39  4.75 ·                     None
2                                                        $65   5.0 ·                     None
3    Kitchen, Wifi, TV, Washer, Air conditioning, P...   $90  4.92 ·                     None
4    Wifi, TV, Air conditioning, Hair dryer, Paid p...   $18  4.67 ·                     None
..                                                 ...   ...     ...                      ...
115  Kitchen, Wifi, Free street parking, Pets allow...   $43  4.83 ·  1 queen bed, 1 sofa bed
116  Kitchen, Wifi, HDTV with Netflix, Elevator, Ai...   $38  4.73 ·             1 double bed
117  Wifi, Dedicated workspace, TV, Elevator, AC - ...   $34  4.85 ·                     None
118  City skyline view, Kitchen, Wifi, Dedicated wo...   $47  4.81 ·                     None
119  Kitchen, Wifi, Free street parking, TV with Ne...   $38  4.88 ·                     None

[120 rows x 4 columns]