I have the same site with same json structure. It works for one but doesn't work for second link. What is the reason? What is the logic of "group(1)"
Code
import requests, re, json
r1 = requests.get('https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132').text
r2 = requests.get('https://www.trendyol.com/lenovo/v15-82nb003gtx-i5-10210u-8gb-512gb-mx330-ssd-15-6-fhd-windows-10-home-dizustu-bilgisayar-p-320147814').text
json_data1 = json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r1).group(1))
print('json_data1:',json_data1['product']['attributes'][0]['value']['name'])
print('-------')
json_data2 = json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r2).group(1))
print('json_data2:',json_data2['product']['attributes'][0]['value']['name'])
Output
json_data1: Apple M1
-------
Traceback (most recent call last):
File "/Users/cucal/Desktop/scraper.py", line 33, in <module>
json_data2 = json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r2).group(1))
AttributeError: 'NoneType' object has no attribute 'group'
[Finished in 453ms with exit code 1]
[cmd: ['python3', '-u', '/Users/cucal/Desktop/scraper.py']]
[dir: /Users/cucal/Desktop]
[path: /opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/Apple/usr/bin:/Applications/Postgres.app/Contents/Versions/latest/bin]
CodePudding user response:
Did you see the diffence,
In [1]: re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r1)
Out[1]: <re.Match object; span=(136194, 164410), match='window.__PRODUCT_DETAIL_APP_INITIAL_STATE__={"pro>
In [2]: re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r2)
The first one returns None
That means there is no matching for the regex
you provided there. That's why you got error for second one.
What is group(1)
in your regex?
For the second sample you have to do like this,
matches = re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__\s=\s({.*}})", r2)
if matches:
json_data2 = json.loads(matches.group(1))
print('json_data2:',json_data2['product']['attributes'][0]['value']['name'])
Which output:
json_data2: Intel Core i5
And if you observe the second webpage source, it has spaces before the =
so you need to adjust the regex accordingly.
New regex for the second sample, 'regex_demo'
r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__\s=\s({.*}})"