Scraping Hotel Info by using the existing list of urls in csv file-CodePudding

I have scrapped urls of 3 hotel information pages from TripAdvisor and stored in a csv file. After importing the csv file, I have to use these 3 urls to scrape each hotel name, get the price range of each hotel and their hotel class. The tool of Selenium is used.

Name	Link
The Upper House	https://en.tripadvisor.com.hk/Hotel_Review-g294217-d1513860-Reviews-The_Upper_House-Hong_Kong.html
Hotel ICON	https://en.tripadvisor.com.hk/Hotel_Review-g294217-d2031570-Reviews-Hotel_ICON-Hong_Kong.html
W Hong Kong	https://en.tripadvisor.com.hk/Hotel_Review-g294217-d1068719-Reviews-W_Hong_Kong-Hong_Kong.html

Here is my code. When using the URL of single hotel, I can scrape the name of hotel. However, when it comes to a lot of hotels to scrape, it doesn't work. It seems there are problems in "for" loop.

!pip install selenium

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
import csv
from time import sleep
from time import time
from random import randint

browser = webdriver.Chrome(executable_path= 'C:\ProgramData\Anaconda3\Lib\site-packages\jupyterlab\chromedriver.exe')
result_list=[]

def start_request(q):
   r = browser.get(q)
   print("crlawling " q)
   return r

def parse(text):
   container1 = browser.find_elements_by_xpath('//*[@id="taplc_hotel_review_atf_hotel_info_web_component_0"]')
   mydict = {}

   for results in container1:
        try:
            mydict['name'] = results.find_element_by_xpath('//*[@id="HEADING"]')

         except Exception as e:
            print(e)
            print('not____________________________found')
            mydict['name'] = 'null'
            result_list.append(mydict)

with open('Best3HotelsLink.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
          req = row['Link']
          text = start_request(req)
          parse(text)
          sleep(randint(1,3))

import pandas as pd
df = pd.DataFrame(result_list)
df.to_csv('Detailed Hotelinfo.csv')
df

I also have tried to scrape the hotel class and the price range of the hotels, but in vain. Hotel Class Price Range

I would like to seek your advice on how to fix the above problems. Many thanks.

CodePudding user response：

if you have lot informations to scrap i suggest you to reload informations each time:

try this code:

def parse(text):
   time.sleep(2)   # i suggzest you to add some time to wait to load the page
   container1 = browser.find_elements_by_xpath('//*[@id="taplc_hotel_review_atf_hotel_info_web_component_0"]')
   nbrcontainer = len(container1)
   mydict = {}

   for i in range(0, nbrcontainer):
        container1 = browser.find_elements_by_xpath('//*[@id="taplc_hotel_review_atf_hotel_info_web_component_0"]')
        results = container1[i]
        try:
            mydict['name'] = results.find_element_by_xpath('//*[@id="HEADING"]')

         except Exception as e:
            print(e)
            print('not____________________________found')
            mydict['name'] = 'null'
            result_list.append(mydict)

CodePudding user response：

i'm no good with selenium so here's how to catch price range and hotel class with beautifulsoup. both are inside different divs with same id (...), so it's hard to scrape. i don't think selenium can handle the first selector but the second should work though

soup = BeautifulSoup(html_data, 'lxml')
price_range=soup.select_one('div:-soup-contains("PRICE RANGE")   div').text
hotel_class=soup.select_one('#ABOUT_TAB svg[title*="bubbles"]')['title']

they have an API, that may be worth it if you have a lot of scraping to do on this site. the code is so bad i think it's already worth it but that's just my opinion