Home > Net >  Web Sraping TripAdvisor
Web Sraping TripAdvisor

Time:03-23

Just wanted to scrap the titles of the museum in Moscow and tried this code:

import requests
from bs4 import BeautifulSoup

for offset in range(0, 726, 30):
    print('--- page offset:', offset, '---')

    url = 'https://www.tripadvisor.ru/Attractions-g298484-Activities-c49'   str(offset)   '-Moscow_Central_Russia.html#EATERY_LIST_CONTENTS'

    r = requests.get(url, timeout=10, headers={'User-Agent': 'some cool user-agent'})
    soup = BeautifulSoup(r.text, "html.parser")

    for link in soup.findAll('a', {'title'}):
          print(link.text.strip())

But nothing happened :( would be thankful for an advice!

CodePudding user response:

There's an error in your url (missing -oa) and your soup query dont match html structure of html (a elements dont have title attribute).

import requests
import os
from bs4 import BeautifulSoup
import re

for offset in range(0, 726, 30):
    url = 'https://www.tripadvisor.ru/Attractions-g298484-Activities-c49-t161-oa'   str(offset)   '-Moscow_Central_Russia.html'
    r = requests.get(url, timeout=10, headers={'User-Agent': 'some cool user-agent'})
    soup = BeautifulSoup(r.text, "html.parser")
    for link in soup.findAll('a'):
        text = link.text.strip()
        if re.match('[0-9] \\.',text):
            print(text)

CodePudding user response:

What happens?

Main issue the genreated urls do not point to the right sites, cause some characters are missing, so sites are empty. This may happens cause the url structure seems to be dynamic. Second one is that your selection findAll() is not finding anything.

Note: In new code use find_all() instead of old syntax findAll()

How to fix?

First - Start with a basic link and generate all other links from next page button, working with ranges is only second best solution.

soup.select_one('[aria-label="Next page"]')

Second - Select your elements more specific:

soup.select('a:has(h3)')
Example
import requests 
from bs4 import BeautifulSoup

url = 'https://www.tripadvisor.ru/Attractions-g298484-Activities-c49-t161-Moscow_Central_Russia.html#EATERY_LIST_CONTENTS'

while True:
    
    r = requests.get(url, timeout=10, headers={'User-Agent': 'some cool user-agent'})
    soup = BeautifulSoup(r.text, "html.parser")
    
    for link in soup.select('a:has(h3)'):
        print(link.text.strip())
    
    if (a := soup.select_one('[aria-label="Next page"]')):
        url = 'https://www.tripadvisor.ru' a['href']
    else:
        break
Output
1. Московский Кремль
2. Царицыно Музей-Заповедник
3. Московский Государственный Объединенный Музей-Заповедник "Коломенское"
4. Музей советских игровых автоматов
5. Еврейский музей и центр толерантности
6. Усадьба Кусково
7. Музей "Московский транспорт"
8. Музей Михаила Булгакова
9. Парк Музеон
...
  • Related