Home > Back-end >  Scrape data after element click (or) href link auto click on webpage during webscrapping
Scrape data after element click (or) href link auto click on webpage during webscrapping

Time:02-17

Scrape data after element click (or) href link auto click on webpage during webscrapping, please note there is no x-path to click. Please guide me, i am new for invidual elements clicks.

https://www.yelp.com/search?find_desc=Gastroenterologist&find_loc=Houston, TX 77002 - i can able to scrape this link but i am not aware how to scrape invidual elements and tags please guide me with reference code, if it is any other menthod also fine. Thanks in advance

Invidual link - https://www.yelp.com/biz/john-clemmons-jr-md-houston?osq=Gastroenterologist

#required outputs are- 1. phone number - (713) 526-4263, 
#                     2. address      - 1200 Binz St Ste 1025 Park Plaza Medical Associates Houston, #TX 77004,
#                     3. webaddress  - http://www.Parkplazamed.com,

#format = [phone_number1, phone_number2, etc, ....]
import bs4
from bs4 import BeautifulSoup
from csv import writer
import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:96.0) Gecko/20100101 Firefox/96.0'}
#HOST = 'https://www.zocdoc.com'
#PAGE = 'gastroenterologists/2'
web_page = 'https://www.yelp.com/search?find_desc=Gastroenterologist&find_loc=Houston, TX 77002&ns=1'
with requests.Session() as session:
    (r := session.get(HOST, headers=headers)).raise_for_status()
    #(r := session.get(f'{HOST}/{PAGE}', headers=headers)).raise_for_status()
    (r := session.get(f'{web_page}', headers=headers)).raise_for_status()
    # process content from here
print(r.text)

soup = BeautifulSoup(r.text, 'lxml')
soup
print(soup.prettify())

movies_html = soup.find_all('a', attrs={'class': 'css-1422juy'})

doctor_n = []

for title in movies_html:
 doctor_n.append(title.text.strip())
print(doctor_n)

CodePudding user response:

To get the data of the local business, you can parse the Json data embedded inside the page. For example:

import json
import requests
from bs4 import BeautifulSoup


url = "https://www.yelp.com/biz/john-clemmons-jr-md-houston?osq=Gastroenterologist"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

data = {}
for d in soup.select('[type="application/ld json"]'):
    d = json.loads(d.contents[0])
    data[d["@type"]] = d


print(data["LocalBusiness"]["name"])
print(data["LocalBusiness"]["telephone"])
print(data["LocalBusiness"]["address"])

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

Prints:

John Clemmons Jr, MD
(713) 528-6562
{
    "streetAddress": "1213 Hermann Dr\nSte 420",
    "addressLocality": "Houston",
    "addressCountry": "US",
    "addressRegion": "TX",
    "postalCode": "77004",
}
  • Related