Scrape a phone number inside a popup button using python beautifulsoup-CodePudding

I want to scrape a hidden phone number from a website using beautifulsoup

https://haraj.com.sa/1194697687, as you can see in this link the phone number is hidden, and it only showed when you click "التواصل" button

The Button

Here is my code

from bs4 import BeautifulSoup

url = "https://haraj.com.sa/1199808969"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36.'}



r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content,features='lxml')

post = soup.find('span', {'class', 'contact'})


print(post)

and here is the output I got

<span ><button  type="button"><img src="https://v8-cdn.haraj.com.sa/logos/contact_logo.svg" style="margin-left:5px;filter:brightness(0) invert(1)"/>التواصل</button></span>

CodePudding user response：

BeautifulSoup won't be enough for what you're trying to do - it's just an HTML parser. And Selenium is overkill. The page you're trying to scrape from uses JavaScript to dynamically and asynchronously populate the DOM with content when you press the button. If you make a request to that page in Python, and try to parse the HTML, you're only looking at the barebones template, which would normally get populated later on by the browser. The data for the modal comes from a fetch/XHR HTTP POST request to a GraphQL API, the response of which is JSON. If you use your browser's developer tools to log your network traffic when you press the button, you can see the HTTP request URL, query-string parameters, POST payload, request headers, etc. You just need to mimic that request in Python - fortunately this API seems to be pretty lenient, so you won't have to provide all the same parameters that the browser provides:

def main():
    import requests

    url = "https://graphql.haraj.com.sa"

    params = {
        "queryName": "postContact",
        "token": "",
        "clientId": "",
        "version": ""
    }

    headers = {
        "user-agent": "Mozilla/5.0"
    }

    payload = {   
        "query": "query postContact($postId: Int!) {postContact(postId: $postId){contactText}}",
        "variables": {
            "postId": 94697687
        }
    }

    response = requests.post(url, params=params, headers=headers, json=payload);
    response.raise_for_status()

    print(response.json())

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

{'data': {'postContact': {'contactText': '0562038953'}}}

CodePudding user response：

I was able to do this using selenium and chromedriver https://chromedriver.chromium.org/downloads just be sure to change the path to where you extract chromedriver and install selenium via pip;

pip install selenium

main.py

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time

url = "https://haraj.com.sa/1199808969"

def main():
    print(get_value())
    

def get_value():
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    driver = webdriver.Chrome("C:\Developement\chromedriver.exe",chrome_options=chrome_options)
    driver.get(url)
    driver.find_element(By.CLASS_NAME, "AGAbw").click()
    time.sleep(5)
    val = driver.find_element(By.XPATH, '//*[@id="modal"]/div/div/a[2]/div[2]').text
    driver.quit()
    return val

main()

Output:

[0829/155029.109:INFO:CONSOLE(1)] "HBM Loaded", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155030.571:INFO:CONSOLE(1)] "[object Object]", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155030.604:INFO:CONSOLE(1)] "[object Object]", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155031.143:INFO:CONSOLE(16)] "Yay! SW loaded ≡ƒÄë", source: https://haraj.com.sa/sw.js (16)
0559559838