Scrape data after element click (or) href link auto click on webpage during webscrapping, please note there is no x-path to click. Please guide me, i am new for invidual elements clicks.
https://www.yelp.com/search?find_desc=Gastroenterologist&find_loc=Houston, TX 77002 - i can able to scrape this link but i am not aware how to scrape invidual elements and tags please guide me with reference code, if it is any other menthod also fine. Thanks in advance
Invidual link - https://www.yelp.com/biz/john-clemmons-jr-md-houston?osq=Gastroenterologist
#required outputs are- 1. phone number - (713) 526-4263,
# 2. address - 1200 Binz St Ste 1025 Park Plaza Medical Associates Houston, #TX 77004,
# 3. webaddress - http://www.Parkplazamed.com,
#format = [phone_number1, phone_number2, etc, ....]
import bs4
from bs4 import BeautifulSoup
from csv import writer
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:96.0) Gecko/20100101 Firefox/96.0'}
#HOST = 'https://www.zocdoc.com'
#PAGE = 'gastroenterologists/2'
web_page = 'https://www.yelp.com/search?find_desc=Gastroenterologist&find_loc=Houston, TX 77002&ns=1'
with requests.Session() as session:
(r := session.get(HOST, headers=headers)).raise_for_status()
#(r := session.get(f'{HOST}/{PAGE}', headers=headers)).raise_for_status()
(r := session.get(f'{web_page}', headers=headers)).raise_for_status()
# process content from here
print(r.text)
soup = BeautifulSoup(r.text, 'lxml')
soup
print(soup.prettify())
movies_html = soup.find_all('a', attrs={'class': 'css-1422juy'})
doctor_n = []
for title in movies_html:
doctor_n.append(title.text.strip())
print(doctor_n)
CodePudding user response:
To get the data of the local business, you can parse the Json data embedded inside the page. For example:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.yelp.com/biz/john-clemmons-jr-md-houston?osq=Gastroenterologist"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = {}
for d in soup.select('[type="application/ld json"]'):
d = json.loads(d.contents[0])
data[d["@type"]] = d
print(data["LocalBusiness"]["name"])
print(data["LocalBusiness"]["telephone"])
print(data["LocalBusiness"]["address"])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
Prints:
John Clemmons Jr, MD
(713) 528-6562
{
"streetAddress": "1213 Hermann Dr\nSte 420",
"addressLocality": "Houston",
"addressCountry": "US",
"addressRegion": "TX",
"postalCode": "77004",
}