I would like to consider the case when the site structure is crooked, there are a lot of identical classes and tags. In these classes, the necessary information is under different indices, where [3], where [6]. Where not at all. Suppose there is: 'Div' class = 'data-123'
Address ' Div 'class = 'data-123'
Information ' Div 'class = 'data-123' Phone - 1234567890
Etc This is the phone I need to get. And in different cards, it is in a random order, or maybe it doesn’t exist at all. That is, it will not find it by xpath because it is different every time, the selector can be identical with some other parameter. Perhaps by the word "telephone" in this class? How to get out of this situation?
CodePudding user response:
To get company titles and telephones you can do:
import requests
from bs4 import BeautifulSoup
url = "https://www.ua-region.com.ua/ru/kved/49.41"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for company in soup.select(".cart-company-lg"):
title = company.select_one(".cart-company-lg__title").text
telephones = [t.text for t in company.select('[href^="tel"]')]
print(title)
print(*telephones)
print("-" * 80)
Prints:
СПЕЦ-Ф-ТРАНС, ООО
38 (093) 7756...
...and so on.
EDIT: To get titles/telephones from company page:
import requests
from bs4 import BeautifulSoup
url = "https://www.ua-region.com.ua/ru/43434454"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
sidebar = soup.select_one(".company-item-sidebar")
name = sidebar.select_one('span:-soup-contains("Юридическое название")')
name = name.find_next(text=True) if name else "N/A"
telephones = sidebar.select_one('span:-soup-contains("Телефоны")')
telephones = (
[a["href"] for a in telephones.find_next("div").find_all("a")]
if telephones
else []
)
# get other items here
# ...
print(name)
print(telephones)
print()
Prints:
Юридическое название
['tel: 380636540215', 'tel: 380685496792', 'tel: 380952052509']