Hi guys i am trying to scrape some data from airbnb in order to create a mini data analysis project for my portfolio.
I tried several tutorials with BeautifulSoup
but none of them is working today, even if I use the very same link that they are using in the tutorials.
Due to this I turned to Selenium
, I achieved to enter the side and I am trying to extract the names for in the first stage. Then I would like to extract all the information (price, reviews, rating, anemities etc.)
My code is the following but I am getting an empty list. Can anyone help me how can i get the name of the appt ?
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
import pandas as pd
from selenium.webdriver.common.by import By
website = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=unknown'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(website)
titles = driver.find_elements("class name", "n1v28t5c s1cjsi4j dir dir-ltr")
Thanks.
CodePudding user response:
Selenium with bs4 working fine without any issues and getting the right data. Just run the code.
Example:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
import pandas as pd
import time
url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=unknown&price_filter_input_type=0&federated_search_session_id=6c89837f-b442-4b3b-bf1b-4d2a659c0000&pagination_search=true&items_offset=40§ion_offset=2'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)
driver.maximize_window()
time.sleep(5)
soup=BeautifulSoup(driver.page_source, 'lxml')
for card in soup.select('div[]'):
title = card.select_one('div[]').text
price = card.select_one('span[]').text
print(title, price)
Output:
Apartment in Thessaloniki $39 per night
Apartment in Thessaloniki $37 per night
Apartment in Thessaloniki $39 per night
Apartment in Thessaloniki $41 per night
Apartment in Thessaloniki $41 per night
Apartment in Thessaloniki $27 per night
Apartment in Thessaloniki $37 per night
Condo in Thessaloniki $34 per night
Home in Ana Polis $31 per night
Apartment in Ana Polis $34 per night
Condo in Thessaloniki $65 per night
Apartment in Thessaloniki $65 per night
Apartment in Thessaloniki $46 per night
Apartment in Thessaloniki $27 per night
Apartment in Ladadika $50 per night
Apartment in Ana Polis $71 per night
Apartment in Thessaloniki $24 per night
Condo in Thessaloniki $45 per night
Condo in Thessaloniki $40 per night
Apartment in Ladadika $57 per night
CodePudding user response:
To extract the names of the properties you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
driver.get('https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=unknown') print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id^='title']")))])
Using XPATH:
driver.get('https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki, Greece&date_picker_type=calendar&search_type=unknown') print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[starts-with(@id, 'title') and text()]")))])
Console Output:
['Flat in Thessaloniki', 'Apartment in Thessaloniki', 'Flat in Thessaloniki', 'Apartment in Thessaloniki', 'Apartment in Thessaloniki', 'Loft in Thessaloniki', 'Flat in Thessaloniki', 'Flat in Thessaloniki', 'Apartment in Thessaloniki', 'Apartment in Thessaloniki', 'Flat in Thessaloniki', 'Flat in Thessaloniki', 'Apartment in Thessaloniki', 'Flat in Thessaloniki', 'Apartment in Thessaloniki', 'Apartment in Thessaloniki', 'Flat in Thessaloniki', 'Flat in Thessaloniki', 'Flat in Thessaloniki', 'Apartment in Agios Pavlos']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
CodePudding user response:
driver.find_elements("class name", "n1v28t5c s1cjsi4j dir dir-ltr")
Will return 0 elements. By.CLASS_NAME can only find elements based on one class
("n1v28t5c s1cjsi4j dir dir-ltr" is actually 4 separate classes of the element you're trying to locate). You can locate elements with multiple classes using, for example, XPATH selectors.
driver.find_elements(By.XPATH, '//div[@]')
This will find all the 20 elements in the page. I strongly encourage you to learn more about XPATH as it's pretty simple to understand and very powerful