I manage to scrape all the data from the landing page of AirBnB (Price, Name, Ratings etc.), I also know how to use a loop in order use the pagination in order to scrape data from multiple pages.
What I would like to do is to scrape data for each specific listing, i.e data which is within the listing page (description, amenities, etc.).
What I was thinking is to implement same logic as the pagination since I have a list
with links but it's difficult to me to understand how can I do it.
Here is the code to scrape the links:
Imports
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
import pandas as pd
import time
Getting the page
airbnb_url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=unknown'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(airbnb_url)
driver.maximize_window()
time.sleep(5)
Scraping links
links = []
soup=BeautifulSoup(driver.page_source, 'lxml')
for card in soup.select('div[]'):
links.append('https://www.airbnb.com' card.select_one('a[]')['href'])
What I used to extract the "where to sleep" section is this but probably I am using a wrong tag.
amenities = []
for url in links:
driver.get(url)
soup1 = BeautifulSoup(driver.page_source, 'lxml')
for amenity in soup1.select('div[]'):
amenities.append(amenity.select_one('div[]'))
My first question was that and the other one is if any nobody how can I scrape the availability of each listing.
Thanks a lot!
CodePudding user response:
You wanna scrape total listing pages along with its each details page. So, each page contains total item numbers 20 aka 20 offset items meaning each time on each page's offset is incremented by 20 listing items.I've made the pagination in starting url by following offset then go to details page invoking driver and soup for the second time and from the details page, you have to extract all the necessary information.
There 15 pages and a single page has 20 listing items so total listing is 15*20 = 300 listings. I've scraped 6 pages meaning 120 items meaning(0,120,20).You can pull all (0,300,20) items just inject them inside the range function. At first, test my code aftermath scrape all the pages as selenium is a bit slow so it will consume a bit more time.
Script:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
import pandas as pd
import time
url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=user_map_move&price_filter_input_type=0&ne_lat=40.66256734970964&ne_lng=23.003752862853986&sw_lat=40.59051931897441&sw_lng=22.892087137145978&zoom=13&search_by_map=true&federated_search_session_id=1ed21e1c-0c5e-4529-ab84-267361eac02b&pagination_search=true&items_offset={offset}§ion_offset=2'
data = []
for offset in range(0,120,20):
driver.get(url.format(offset=offset))
driver.maximize_window()
time.sleep(3)
soup=BeautifulSoup(driver.page_source, 'lxml')
detailed_pages = []
for card in soup.select('div[]'):
link = 'https://www.airbnb.com' card.select_one('a[]').get('href')
detailed_pages.append(link)
for page in detailed_pages:
driver.get(page)
driver.maximize_window()
time.sleep(2)
soup2=BeautifulSoup(driver.page_source, 'lxml')
price = soup2.select_one('span._tyxjp1')
price = price.text if price else None
rating= soup2.select_one('span._12si43g')
rating = rating.text if rating else None
Bedroom_area = soup2.select_one('div[]')
Bedroom_area = Bedroom_area.text if Bedroom_area else None
place_offers= ', '.join([x.get_text(strip=True) for x in soup2.select('[] div:nth-of-type(3) > div')])
data.append({
'place_offers': place_offers,
'price':price,
'rating':rating,
'Bedroom_area': Bedroom_area
})
df=pd.DataFrame(data)
print(df)
Output:
place_offers price rating Bedroom_area
0 $23 None None
1 $39 4.75 · None
2 $65 5.0 · None
3 Kitchen, Wifi, TV, Washer, Air conditioning, P... $90 4.92 · None
4 Wifi, TV, Air conditioning, Hair dryer, Paid p... $18 4.67 · None
.. ... ... ... ...
115 Kitchen, Wifi, Free street parking, Pets allow... $43 4.83 · 1 queen bed, 1 sofa bed
116 Kitchen, Wifi, HDTV with Netflix, Elevator, Ai... $38 4.73 · 1 double bed
117 Wifi, Dedicated workspace, TV, Elevator, AC - ... $34 4.85 · None
118 City skyline view, Kitchen, Wifi, Dedicated wo... $47 4.81 · None
119 Kitchen, Wifi, Free street parking, TV with Ne... $38 4.88 · None
[120 rows x 4 columns]
CodePudding user response:
BeautifulSoup
is more comfortable in some situations, but not allways needed in your scraping process - Also avoid selecting your elements by dynamic classes, using time
and switch to selenium waits
Iterating over all listing pages I would recommend to use a while-loop
to keep your script generic and check in every iteration if there is a next page available, else break
your loop. This eliminates the need to manually count pages and entries as well as the use of static range()
.
try:
next_page = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="EXPLORE_NUMBERED_PAGINATION:TAB_ALL_HOMES"] button a'))).get_attribute('href')
except:
next_page = None
#### process your data
if next_page:
airbnb_url = next_page
else:
break
To scrape all of the amenities you have to open the modal via button click:
[i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))]
Note: To avoid errors while other elements get the clicks check if you have to handle cookie banners
To extract bedroom information check for more static information like ids or HTML structure and also check if element is available - This lines extract all infos in this section and creates a dict
from heading and value:
if soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div div'):
sleep_areas = list(soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div div').stripped_strings)
d.update(dict(zip(sleep_areas[0::2], sleep_areas[1::2])))
else:
d.update({'Bedroom':None})
Example
Just to point in a direction and that not everybody has to do a full scrape, I limited scraping of objects in this example to urls[:1]
per page, simply remove [:1]
to get all results.
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("--lang=en")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)
airbnb_url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=unknown'
driver.maximize_window()
data = []
while True:
driver.get(airbnb_url)
urls = list(set(a.get_attribute('href') for a in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[itemprop="itemListElement"] a')))))
try:
next_page = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="EXPLORE_NUMBERED_PAGINATION:TAB_ALL_HOMES"] button a'))).get_attribute('href')
except:
next_page = None
print('Scrape listings from page:' str(next_page))
for url in urls[:1]:
driver.get(url)
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="AMENITIES_DEFAULT"] button'))).click()
soup = BeautifulSoup(driver.page_source)
d = {
'title':soup.h1.text,
'amenities':[i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))]
}
if soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div div'):
sleep_areas = list(soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div div').stripped_strings)
d.update(dict(zip(sleep_areas[0::2], sleep_areas[1::2])))
else:
d.update({'Bedroom':None})
data.append(d)
if next_page:
airbnb_url = next_page
else:
break
pd.DataFrame(data)
Output
title | amenities | Common space | Bedroom | Living room | |
---|---|---|---|---|---|
0 | 8 NETFLIX BELGIUM HELLEXPO UNIVERSITY | ['', 'Shampoo', 'Essentials', 'Hangers', 'Iron', 'TV', 'Air conditioning', 'Heating', 'Smoke alarm', 'Carbon monoxide alarm', 'Wifi', 'Dedicated workspace', 'Cooking basics', 'Long term stays allowed', 'Unavailable: Security cameras on property', 'Unavailable: Kitchen', 'Unavailable: Washer', 'Unavailable: Private entrance'] | 1 sofa bed | nan | nan |
4 | ASOPOO STUDIO | ['Hair dryer', 'Shampoo', 'Hot water', 'Essentials', '', 'Bed linens', 'Iron', 'TV', 'Heating', 'Wifi', 'Kitchen', 'Refrigerator', 'Dishes and silverware', 'Free street parking', 'Elevator', 'Paid parking off premises', 'Long term stays allowed', 'Host greets you', 'Unavailable: Washer', 'Unavailable: Air conditioning', 'Unavailable: Smoke alarm', 'Unavailable: Carbon monoxide alarm', 'Unavailable: Private entrance'] | 1 sofa bed | 1 double bed | nan |
14 | Aristotelous 8th floor 1bd apt with wonderful view | ['Hot water', 'Shower gel', 'Free washer – In unit', 'Essentials', 'Hangers', 'Bed linens', 'Iron', 'Drying rack for clothing', 'Clothing storage', 'TV', 'Pack ’n play/Travel crib - available upon request', 'Air conditioning', 'Heating', 'Wifi', 'Dedicated workspace', 'Kitchen', 'Refrigerator', 'Microwave', 'Cooking basics', 'Dishes and silverware', 'Stove', 'Hot water kettle', 'Coffee maker', 'Baking sheet', 'Coffee', 'Dining table', 'Private patio or balcony', 'Outdoor furniture', 'Paid parking off premises', 'Pets allowed', 'Luggage dropoff allowed', 'Long term stays allowed', 'Self check-in', 'Lockbox', 'Unavailable: Security cameras on property', 'Unavailable: Smoke alarm', 'Unavailable: Carbon monoxide alarm', 'Unavailable: Private entrance'] | nan | 2 double beds | 1 sofa bed |
Cause there is no expected output available, just some additional thoughts:
If you like to have your amenities not in a list
but as string, simply ','.join()
them:
'amenities':','.join([i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))])
If you like to have a matrix with true / false
you could manipulate your DataFrame
...
df = pd.DataFrame(data)
df = df.explode('amenities')
pd.crosstab(df['title'],df['amenities']).ne(0).rename_axis(index='title',columns=None).reset_index()
Output:
title 32" HDTV 32" HDTV with standard cable 32" TV AC - split type ductless system Air conditioning Babysitter recommendations Backyard Baking sheet ... Unavailable: Kitchen Unavailable: Private entrance Unavailable: Security cameras on property Unavailable: Shampoo Unavailable: Smoke alarm Unavailable: TV Unavailable: Washer Washer Wifi Wine glasses
0 #SKGH Amaryllis luxury suite -NearHELEXPO False False False False False True False False False ... False True False False True False False True True False
1 8 NETFLIX BELGIUM HELLEXPO UNIVERSITY True False False False False True False False False ... True True True False False False True False True False
2 ASOPOO STUDIO True False False False False False False False False ... False True False False True False True False True False
...