I want to scrape a hidden phone number from a website using beautifulsoup
https://haraj.com.sa/1194697687, as you can see in this link the phone number is hidden, and it only showed when you click "التواصل" button
Here is my code
from bs4 import BeautifulSoup
url = "https://haraj.com.sa/1199808969"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36.'}
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content,features='lxml')
post = soup.find('span', {'class', 'contact'})
print(post)
and here is the output I got
<span ><button type="button"><img src="https://v8-cdn.haraj.com.sa/logos/contact_logo.svg" style="margin-left:5px;filter:brightness(0) invert(1)"/>التواصل</button></span>
CodePudding user response:
BeautifulSoup won't be enough for what you're trying to do - it's just an HTML parser. And Selenium is overkill. The page you're trying to scrape from uses JavaScript to dynamically and asynchronously populate the DOM with content when you press the button. If you make a request to that page in Python, and try to parse the HTML, you're only looking at the barebones template, which would normally get populated later on by the browser. The data for the modal comes from a fetch/XHR HTTP POST request to a GraphQL API, the response of which is JSON. If you use your browser's developer tools to log your network traffic when you press the button, you can see the HTTP request URL, query-string parameters, POST payload, request headers, etc. You just need to mimic that request in Python - fortunately this API seems to be pretty lenient, so you won't have to provide all the same parameters that the browser provides:
def main():
import requests
url = "https://graphql.haraj.com.sa"
params = {
"queryName": "postContact",
"token": "",
"clientId": "",
"version": ""
}
headers = {
"user-agent": "Mozilla/5.0"
}
payload = {
"query": "query postContact($postId: Int!) {postContact(postId: $postId){contactText}}",
"variables": {
"postId": 94697687
}
}
response = requests.post(url, params=params, headers=headers, json=payload);
response.raise_for_status()
print(response.json())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
{'data': {'postContact': {'contactText': '0562038953'}}}
CodePudding user response:
I was able to do this using selenium and chromedriver https://chromedriver.chromium.org/downloads just be sure to change the path to where you extract chromedriver and install selenium via pip;
pip install selenium
main.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time
url = "https://haraj.com.sa/1199808969"
def main():
print(get_value())
def get_value():
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome("C:\Developement\chromedriver.exe",chrome_options=chrome_options)
driver.get(url)
driver.find_element(By.CLASS_NAME, "AGAbw").click()
time.sleep(5)
val = driver.find_element(By.XPATH, '//*[@id="modal"]/div/div/a[2]/div[2]').text
driver.quit()
return val
main()
Output:
[0829/155029.109:INFO:CONSOLE(1)] "HBM Loaded", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155030.571:INFO:CONSOLE(1)] "[object Object]", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155030.604:INFO:CONSOLE(1)] "[object Object]", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155031.143:INFO:CONSOLE(16)] "Yay! SW loaded 🎉", source: https://haraj.com.sa/sw.js (16)
0559559838