I am trying to get a link from a weboage HTML in python. the problem is that when I open the chrome inspect tool, I can see that the link is like this:
<meta property="og:image" content="https://dkstatics-public.digikala.com/digikala-products/09edc9a95239bbd46cbf5d2f344fc45620166666_1620816530.jpg?x-oss-process=image/resize,m_lfit,h_350,w_350/quality,q_60">
But, when I get the HTML using this code, this line doesn't exist in the HTML.
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.request import Request, urlopen
import re
url = 'https://www.digikala.com/product/dkp-729879/شورت-مردانه-آریان-نخ-باف-کد-1312-مجموعه-3-عددی/'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
req = Request(url,headers=headers)
webpage = str(urlopen(req).read())
print(webpage)
The HTML that I get from python seems to be a lot shorter and doesn't contain this element at all. what I want to know is, how can I get that element through python?
CodePudding user response:
The tags you see are created programmatically via Javascript that loads the data from different URL. This example will use requests
/json
module to get the data:
import re
import json
import requests
url = "https://www.digikala.com/product/dkp-729879/شورت-مردانه-آریان-نخ-باف-کد-1312-مجموعه-3-عددی/"
id_ = re.search(r"-(\d )/", url).group(1)
product_url = f"https://api.digikala.com/v1/product/{id_}/"
data = requests.get(product_url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
print(data["data"]["seo"]["open_graph"]["image"])
Prints:
https://dkstatics-public.digikala.com/digikala-products/113697641.jpg?x-oss-process=image/resize,m_lfit,h_350,w_350/quality,q_60
CodePudding user response:
The page has a js rendering so the html returned by the query should be evaluated as javascript. Like a browser does!
Try to use a splash docker and send the request througth it and should work or use another tool that works similarly.