How to scrape data from multiple urls on airbnb with Python?-CodePudding

I manage to scrape all the data from the landing page of AirBnB (Price, Name, Ratings etc.), I also know how to use a loop in order use the pagination in order to scrape data from multiple pages.

What I would like to do is to scrape data for each specific listing, i.e data which is within the listing page (description, amenities, etc.).

What I was thinking is to implement same logic as the pagination since I have a list with links but it's difficult to me to understand how can I do it.

Here is the code to scrape the links:

Imports

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
import pandas as pd
import time

Getting the page

airbnb_url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=unknown'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(airbnb_url)

driver.maximize_window()
time.sleep(5)

Scraping links

links = []
soup=BeautifulSoup(driver.page_source, 'lxml')
for card in soup.select('div[]'):
    links.append('https://www.airbnb.com'   card.select_one('a[]')['href'])

What I used to extract the "where to sleep" section is this but probably I am using a wrong tag.

amenities = []
for url in links:
    driver.get(url)
    soup1 = BeautifulSoup(driver.page_source, 'lxml')
    for amenity in soup1.select('div[]'):
        amenities.append(amenity.select_one('div[]'))

My first question was that and the other one is if any nobody how can I scrape the availability of each listing.

Thanks a lot!

CodePudding user response：

You wanna scrape total listing pages along with its each details page. So, each page contains total item numbers 20 aka 20 offset items meaning each time on each page's offset is incremented by 20 listing items.I've made the pagination in starting url by following offset then go to details page invoking driver and soup for the second time and from the details page, you have to extract all the necessary information.

There 15 pages and a single page has 20 listing items so total listing is 15*20 = 300 listings. I've scraped 6 pages meaning 120 items meaning(0,120,20).You can pull all (0,300,20) items just inject them inside the range function. At first, test my code aftermath scrape all the pages as selenium is a bit slow so it will consume a bit more time.

Script:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
import pandas as pd
import time

url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=user_map_move&price_filter_input_type=0&ne_lat=40.66256734970964&ne_lng=23.003752862853986&sw_lat=40.59051931897441&sw_lng=22.892087137145978&zoom=13&search_by_map=true&federated_search_session_id=1ed21e1c-0c5e-4529-ab84-267361eac02b&pagination_search=true&items_offset={offset}&section_offset=2'

data = []
for offset in range(0,120,20):
    driver.get(url.format(offset=offset))
    driver.maximize_window()
    time.sleep(3)
    soup=BeautifulSoup(driver.page_source, 'lxml')


    detailed_pages = []
    for card in soup.select('div[]'):
        link = 'https://www.airbnb.com'   card.select_one('a[]').get('href')
        detailed_pages.append(link)


    for page in detailed_pages:
        driver.get(page)
        driver.maximize_window()
        time.sleep(2)
        soup2=BeautifulSoup(driver.page_source, 'lxml')
        price = soup2.select_one('span._tyxjp1')
        price = price.text if price else None
        rating= soup2.select_one('span._12si43g')
        rating = rating.text if rating else None
        Bedroom_area = soup2.select_one('div[]')
        Bedroom_area = Bedroom_area.text if Bedroom_area else None
        place_offers= ', '.join([x.get_text(strip=True) for x in soup2.select('[] div:nth-of-type(3) > div')])
        data.append({
            'place_offers': place_offers,
            'price':price,
            'rating':rating,
            'Bedroom_area': Bedroom_area
        })

df=pd.DataFrame(data)
print(df)

Output:

       place_offers                                     price   rating             Bedroom_area
0                                                        $23    None                     None
1                                                        $39  4.75 ·                     None
2                                                        $65   5.0 ·                     None
3    Kitchen, Wifi, TV, Washer, Air conditioning, P...   $90  4.92 ·                     None
4    Wifi, TV, Air conditioning, Hair dryer, Paid p...   $18  4.67 ·                     None
..                                                 ...   ...     ...                      ...
115  Kitchen, Wifi, Free street parking, Pets allow...   $43  4.83 ·  1 queen bed, 1 sofa bed
116  Kitchen, Wifi, HDTV with Netflix, Elevator, Ai...   $38  4.73 ·             1 double bed
117  Wifi, Dedicated workspace, TV, Elevator, AC - ...   $34  4.85 ·                     None
118  City skyline view, Kitchen, Wifi, Dedicated wo...   $47  4.81 ·                     None
119  Kitchen, Wifi, Free street parking, TV with Ne...   $38  4.88 ·                     None

[120 rows x 4 columns]

CodePudding user response：

BeautifulSoup is more comfortable in some situations, but not allways needed in your scraping process - Also avoid selecting your elements by dynamic classes, using time and switch to selenium waits

Iterating over all listing pages I would recommend to use a while-loop to keep your script generic and check in every iteration if there is a next page available, else break your loop. This eliminates the need to manually count pages and entries as well as the use of static range().

try:
    next_page = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="EXPLORE_NUMBERED_PAGINATION:TAB_ALL_HOMES"] button   a'))).get_attribute('href')
except:
    next_page = None

#### process your data

if next_page:
    airbnb_url = next_page
else:
    break

To scrape all of the amenities you have to open the modal via button click:

[i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))]

Note: To avoid errors while other elements get the clicks check if you have to handle cookie banners

To extract bedroom information check for more static information like ids or HTML structure and also check if element is available - This lines extract all infos in this section and creates a dict from heading and value:

if soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div div'):
    sleep_areas = list(soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div div').stripped_strings)
    d.update(dict(zip(sleep_areas[0::2], sleep_areas[1::2])))
else:
    d.update({'Bedroom':None})

Example

Just to point in a direction and that not everybody has to do a full scrape, I limited scraping of objects in this example to urls[:1] per page, simply remove [:1] to get all results.

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument("--lang=en")

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)
airbnb_url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=unknown'
driver.maximize_window()


data = []

while True:

    driver.get(airbnb_url)
    urls = list(set(a.get_attribute('href') for a in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[itemprop="itemListElement"] a')))))
    try:
        next_page = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="EXPLORE_NUMBERED_PAGINATION:TAB_ALL_HOMES"] button   a'))).get_attribute('href')
    except:
        next_page = None

    print('Scrape listings from page:'   str(next_page))

    for url in urls[:1]:
        driver.get(url)
        WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="AMENITIES_DEFAULT"] button'))).click()
        soup = BeautifulSoup(driver.page_source)

        d = {
            'title':soup.h1.text,
            'amenities':[i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))]
        }

        if soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div div'):
            sleep_areas = list(soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div div').stripped_strings)
            d.update(dict(zip(sleep_areas[0::2], sleep_areas[1::2])))
        else:
            d.update({'Bedroom':None})
        data.append(d)

    if next_page:
        airbnb_url = next_page
    else:
        break

pd.DataFrame(data)

Output

	title	amenities	Common space	Bedroom	Living room
0	8 NETFLIX BELGIUM HELLEXPO UNIVERSITY	['', 'Shampoo', 'Essentials', 'Hangers', 'Iron', 'TV', 'Air conditioning', 'Heating', 'Smoke alarm', 'Carbon monoxide alarm', 'Wifi', 'Dedicated workspace', 'Cooking basics', 'Long term stays allowed', 'Unavailable: Security cameras on property', 'Unavailable: Kitchen', 'Unavailable: Washer', 'Unavailable: Private entrance']	1 sofa bed	nan	nan
4	ASOPOO STUDIO	['Hair dryer', 'Shampoo', 'Hot water', 'Essentials', '', 'Bed linens', 'Iron', 'TV', 'Heating', 'Wifi', 'Kitchen', 'Refrigerator', 'Dishes and silverware', 'Free street parking', 'Elevator', 'Paid parking off premises', 'Long term stays allowed', 'Host greets you', 'Unavailable: Washer', 'Unavailable: Air conditioning', 'Unavailable: Smoke alarm', 'Unavailable: Carbon monoxide alarm', 'Unavailable: Private entrance']	1 sofa bed	1 double bed	nan
14	Aristotelous 8th floor 1bd apt with wonderful view	['Hot water', 'Shower gel', 'Free washer – In unit', 'Essentials', 'Hangers', 'Bed linens', 'Iron', 'Drying rack for clothing', 'Clothing storage', 'TV', 'Pack ’n play/Travel crib - available upon request', 'Air conditioning', 'Heating', 'Wifi', 'Dedicated workspace', 'Kitchen', 'Refrigerator', 'Microwave', 'Cooking basics', 'Dishes and silverware', 'Stove', 'Hot water kettle', 'Coffee maker', 'Baking sheet', 'Coffee', 'Dining table', 'Private patio or balcony', 'Outdoor furniture', 'Paid parking off premises', 'Pets allowed', 'Luggage dropoff allowed', 'Long term stays allowed', 'Self check-in', 'Lockbox', 'Unavailable: Security cameras on property', 'Unavailable: Smoke alarm', 'Unavailable: Carbon monoxide alarm', 'Unavailable: Private entrance']	nan	2 double beds	1 sofa bed

Cause there is no expected output available, just some additional thoughts:

If you like to have your amenities not in a list but as string, simply ','.join() them:

'amenities':','.join([i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))])

If you like to have a matrix with true / false you could manipulate your DataFrame

...
df = pd.DataFrame(data)
df = df.explode('amenities')
pd.crosstab(df['title'],df['amenities']).ne(0).rename_axis(index='title',columns=None).reset_index()

Output:

    title       32" HDTV    32" HDTV with standard cable    32" TV  AC - split type ductless system Air conditioning    Babysitter recommendations  Backyard    Baking sheet    ... Unavailable: Kitchen    Unavailable: Private entrance   Unavailable: Security cameras on property   Unavailable: Shampoo    Unavailable: Smoke alarm    Unavailable: TV Unavailable: Washer Washer  Wifi    Wine glasses
0   #SKGH Amaryllis luxury suite -NearHELEXPO   False   False   False   False   False   True    False   False   False   ... False   True    False   False   True    False   False   True    True    False
1   8 NETFLIX BELGIUM HELLEXPO UNIVERSITY   True    False   False   False   False   True    False   False   False   ... True    True    True    False   False   False   True    False   True    False
2   ASOPOO STUDIO   True    False   False   False   False   False   False   False   False   ... False   True    False   False   True    False   True    False   True    False
...