I am trying to run this python code but getting the below error. Can't run this code even in google colab. But if I change the range and keep it below 20 like range(1,21) or range(21,41)
then I can run this code in google colab. Why this is happening?
TypeError: the JSON object must be str, bytes or bytearray, not Tag
My python code is:
import requests
import json
from bs4 import BeautifulSoup as bs
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
headers = {
'User-Agent' : 'Mozilla/5.0'
}
main_url = 'https://www.daraz.com.bd/'
search_url = 'mobile-cases-covers'
category_links = []
results = []
for x in range(1,61):
url = main_url search_url '/?page=' str(x)
res = requests.get(url, headers = headers)
soup = bs(res.content, 'lxml')
# print(res)
for script in soup.select('script'):
if 'window.pageData=' in script.text:
script = script.text.replace('window.pageData=','')
break
items = json.loads(script)['mods']['listItems']
print(x)
for item in items:
#print(item)
#extract other info you want
row = [item['name'], item['inStock'], item['priceShow'], item['price'], item['productUrl'], item['ratingScore'], item['review'], item['cheapest_sku'], item['description'], item['brandId'], item['brandName'], item['sellerName']]
results.append(row)
df = pd.DataFrame(results, columns = ['Name', 'instock', 'Price show', 'price', 'ProductUrl', 'Rating', 'review', 'cheapest_sku', 'description', 'brandId','brandName','sellerName'])
df.to_csv(r"/Users/fz/Documents/test/mobile-cases-covers_1.csv", encoding='utf-8', index=False)
Full error message:
Traceback (most recent call last):
File "/Users/fz/Documents/test/scraping_scrpit.py", line 25, in <module>
items = json.loads(script)['mods']['listItems']
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not Tag
CodePudding user response:
Most likely it is related to a rate-limiter that the website owner implemented. I was able to run your code a few times, usually up until the index 51 with no issue, and this is the format:
<script async="" src="//laz-g-cdn.alicdn.com/mtb/??3rd/0.0.10/require.js,lib-promise/3.0.1/polyfillB.js,lib-mtop/2.4.5/mtop.js"></script>
<script src="//www.google-analytics.com/cx/api.js"></script>
<script>
var host = location.host;
if (typeof(cxApi) != 'undefined') {
cxApi.setDomainName(host.substr(host.indexOf(".")));
var list = null
if(list && list.length) {
for (var i = 0; i < list.length; i ) {
if (list[i]) {
cxApi.setChosenVariation(list[i].variation_id, list[i].google_id);
}
}
}
}
</script>
<script>window.pageData=<JSON_OBJECT_HERE>
After a certain number of attempts, the response ends up looking like this:
<script crossorigin="" src="https://laz-g-cdn.alicdn.com/mtb/lib-flexible/0.3.2/flexible.js"></script>
<script crossorigin="" src="https://laz-g-cdn.alicdn.com/code/lib/qrcodejs/1.0.0/qrcode.min.js"></script>
<script>
with(document)with(body)with(insertBefore(createElement("script"),firstChild))setAttribute("exparams","category=&userid=&aplus&yunid=&&trid=212230cc16601209392927822e0a06&asid=AQAAAABrb/NiF9lhIQAAAADjGica ezB7w==",id="tb-beacon-aplus",src=(location>"https"?"//g":"//g") ".alicdn.com/alilog/mlog/aplus_v2.js")
</script>
<script>
window._config_ = {
"renderTo": "#nocaptcha",
"NCTOKENSTR": "5078531ed0bbc731527d465153f47451",
"logo":"https://img.alicdn.com/tfs/TB1jjchwW61gK0jSZFlXXXDKFXa-166-52.png",
"logoLink":"https://www.daraz.com/",
"customImage":"//laz-img-cdn.alicdn.com/tfs/TB1ZOeWA7voK1RjSZFDXXXY3pXa-400-400.png",
"action": "captcha",
"HOST": "www.daraz.com.bd:443",
"isCaptchaLanguageI18n":true,
"PATH": "/mobile-cases-covers",
"copyright":"© 2020 Daraz Group",
"FORMACTIOIN": "/mobile-cases-covers/_____tmd_____/verify/",
"BXSTEP": "100",
"SECDATA": "5e0c8e1365474455070961b803bd560607b52cabf5960afff39b64ce58073f78012dae8a376add8f7d5090538e1e563bcb80a1d11280bfd4955c983649d236c969fd1eb4598cdb51ba5b713ec720bc040536de03738a4f3eab8a99064564e2ad81de66ecf56ceefe58b43bc7d00cbc44524b2377b38c5a19331b3954a03b72e67340a3180c1f31c1f65e505afaa87609f99953e2b99d97859d87ee4897cc2c85bc83e8d4c8c1c8d150624c026cd826af5fd102e32437f7b8741ec5aa54e4ecf04a4d48cee8a93f05658fc0064bc01527f9b86c27692bf843654469bc1ede3d7d503140047fcd21d0cc4e9f46319f7f0c5d79d8982100c636f7a89e277762470f55c9edda4450e5b8708806aa671a4bb3eed4fd46840c7bb8fa02deada30faed84016f2c168ca12bfb7a79f01d90677238ff42f7bdb9c1deea3fe8e78de932dbf26f5bbb0b8c126d92ca05db8544a9bd120bc2079d80bb9f79e6f69c7c441da1c4d110e9af45385863f1616ae3aa4fd5c31f7156be546737cb6307b5a4697dc9c32c3104915d92329621648d5cc7f930be1935123401109bd1196ec7192c765b487ae44e200f5a44ed657a40ed30ec6e5e27617a139be51bc6c0f4ac587c8ba282ec8cc062d07c41e0cbcaff07b3fef604ff15268036a603929b5be59fc64b62327ac38a5476a0767813c4462e881cf32",
"NCAPPKEY": "X82Y__720413049743749f56fc6c438bc76003",
"isUpgrade": "false"
}
</script>
<script crossorigin="" src="https://laz-g-cdn.alicdn.com/bsop-static/sufei-punish/0.1.6/build/punishpage.min.js"></script>
<script crossorigin="" src="https://laz-g-cdn.alicdn.com/dt/tracker/4.0.0/??tracker.Tracker.js,tracker.interfaceTrackerPlugin.js,tracker.performanceTrackerPlugin.js" type="text/javascript"></script>
<script type="text/javascript">
var tracker = new window.Tracker({uidResolver: function() {
// 具体获取 userId 逻辑自行实现
return "5078531ed0bbc731527d465153f47451";
}, pid: 'punish-page', plugins: [[window.interfaceTrackerPlugin], [window.performanceTrackerPlugin, { sampleRate: 1 }]] });
tracker.install();
tracker.log({
code: 11, // 系统自动生成,请勿修改 0.5%
c1: '5078531ed0bbc731527d465153f47451',
c2: '0.1.6',
msg: 'Daraz-web 页面-验证码', // 异常信息,推荐传入
sampleRate: 1.00, // 目前采样率为 100.00%
});
</script>
<script type="text/javascript">
var tracker = new window.Tracker({uidResolver: function() {
// 具体获取 userId 逻辑自行实现
return "5078531ed0bbc731527d465153f47451";
}, pid: 'punish-page', plugins: [[window.interfaceTrackerPlugin], [window.performanceTrackerPlugin, { sampleRate: 1 }]] });
tracker.install();
tracker.log({
code: 11, // 系统自动生成,请勿修改 0.5%
c1: '5078531ed0bbc731527d465153f47451',
c2: '0.1.6',
msg: 'Daraz-web 页面-验证码', // 异常信息,推荐传入
sampleRate: 1.00, // 目前采样率为 100.00%
});
</script>
Therefore, when you hit the rate limit, you start getting that exception because there is no JSON object in the output that you are parsing.
I think this can be potentially solved by implementing a timeout when crawling the platform, but it usually requires some trial & error to get it right, as it is highly dependent on how the target website deals with rate-limiting.