Home > OS >  Python 3 TypeError: the JSON object must be str, bytes or bytearray, not Tag
Python 3 TypeError: the JSON object must be str, bytes or bytearray, not Tag

Time:08-10

I am trying to run this python code but getting the below error. Can't run this code even in google colab. But if I change the range and keep it below 20 like range(1,21) or range(21,41) then I can run this code in google colab. Why this is happening?

TypeError: the JSON object must be str, bytes or bytearray, not Tag

My python code is:

import requests
import json
from bs4 import BeautifulSoup as bs
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

headers = {
    'User-Agent' : 'Mozilla/5.0'
}

main_url = 'https://www.daraz.com.bd/'
search_url = 'mobile-cases-covers'
category_links = []
results = []
for x in range(1,61):
    url = main_url search_url '/?page=' str(x)
    res = requests.get(url, headers = headers)
    soup = bs(res.content, 'lxml')
    # print(res)
    for script in soup.select('script'):
        if 'window.pageData=' in script.text:
            script = script.text.replace('window.pageData=','')
            break
    items = json.loads(script)['mods']['listItems']
    print(x)

    for item in items:
        #print(item)
        #extract other info you want
        row = [item['name'], item['inStock'], item['priceShow'], item['price'], item['productUrl'], item['ratingScore'], item['review'], item['cheapest_sku'], item['description'], item['brandId'], item['brandName'], item['sellerName']]
        results.append(row)       

df = pd.DataFrame(results, columns = ['Name', 'instock', 'Price show', 'price', 'ProductUrl', 'Rating', 'review', 'cheapest_sku', 'description', 'brandId','brandName','sellerName'])


df.to_csv(r"/Users/fz/Documents/test/mobile-cases-covers_1.csv", encoding='utf-8', index=False)

Full error message:

Traceback (most recent call last):
  File "/Users/fz/Documents/test/scraping_scrpit.py", line 25, in <module>
    items = json.loads(script)['mods']['listItems']
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not Tag

CodePudding user response:

Most likely it is related to a rate-limiter that the website owner implemented. I was able to run your code a few times, usually up until the index 51 with no issue, and this is the format:

<script async="" src="//laz-g-cdn.alicdn.com/mtb/??3rd/0.0.10/require.js,lib-promise/3.0.1/polyfillB.js,lib-mtop/2.4.5/mtop.js"></script>
<script src="//www.google-analytics.com/cx/api.js"></script>
<script>
        var host = location.host;
        if (typeof(cxApi) != 'undefined') {
            cxApi.setDomainName(host.substr(host.indexOf(".")));
            var list = null
            if(list && list.length) {
                for (var i = 0; i < list.length; i  ) {
                    if (list[i]) {
                        cxApi.setChosenVariation(list[i].variation_id, list[i].google_id);
                    }
                }
            }
        }
    </script>
<script>window.pageData=<JSON_OBJECT_HERE>

After a certain number of attempts, the response ends up looking like this:

<script crossorigin="" src="https://laz-g-cdn.alicdn.com/mtb/lib-flexible/0.3.2/flexible.js"></script>
<script crossorigin="" src="https://laz-g-cdn.alicdn.com/code/lib/qrcodejs/1.0.0/qrcode.min.js"></script>
<script>
with(document)with(body)with(insertBefore(createElement("script"),firstChild))setAttribute("exparams","category=&userid=&aplus&yunid=&&trid=212230cc16601209392927822e0a06&asid=AQAAAABrb/NiF9lhIQAAAADjGica ezB7w==",id="tb-beacon-aplus",src=(location>"https"?"//g":"//g") ".alicdn.com/alilog/mlog/aplus_v2.js")
</script>
<script>
        window._config_ = {
            "renderTo": "#nocaptcha",
            "NCTOKENSTR": "5078531ed0bbc731527d465153f47451",
            "logo":"https://img.alicdn.com/tfs/TB1jjchwW61gK0jSZFlXXXDKFXa-166-52.png",      
            "logoLink":"https://www.daraz.com/",      
            "customImage":"//laz-img-cdn.alicdn.com/tfs/TB1ZOeWA7voK1RjSZFDXXXY3pXa-400-400.png",
            "action": "captcha",
            "HOST": "www.daraz.com.bd:443",
           "isCaptchaLanguageI18n":true,
            "PATH": "/mobile-cases-covers",
      "copyright":"© 2020 Daraz Group",
            "FORMACTIOIN": "/mobile-cases-covers/_____tmd_____/verify/",
            "BXSTEP": "100",
            "SECDATA": "5e0c8e1365474455070961b803bd560607b52cabf5960afff39b64ce58073f78012dae8a376add8f7d5090538e1e563bcb80a1d11280bfd4955c983649d236c969fd1eb4598cdb51ba5b713ec720bc040536de03738a4f3eab8a99064564e2ad81de66ecf56ceefe58b43bc7d00cbc44524b2377b38c5a19331b3954a03b72e67340a3180c1f31c1f65e505afaa87609f99953e2b99d97859d87ee4897cc2c85bc83e8d4c8c1c8d150624c026cd826af5fd102e32437f7b8741ec5aa54e4ecf04a4d48cee8a93f05658fc0064bc01527f9b86c27692bf843654469bc1ede3d7d503140047fcd21d0cc4e9f46319f7f0c5d79d8982100c636f7a89e277762470f55c9edda4450e5b8708806aa671a4bb3eed4fd46840c7bb8fa02deada30faed84016f2c168ca12bfb7a79f01d90677238ff42f7bdb9c1deea3fe8e78de932dbf26f5bbb0b8c126d92ca05db8544a9bd120bc2079d80bb9f79e6f69c7c441da1c4d110e9af45385863f1616ae3aa4fd5c31f7156be546737cb6307b5a4697dc9c32c3104915d92329621648d5cc7f930be1935123401109bd1196ec7192c765b487ae44e200f5a44ed657a40ed30ec6e5e27617a139be51bc6c0f4ac587c8ba282ec8cc062d07c41e0cbcaff07b3fef604ff15268036a603929b5be59fc64b62327ac38a5476a0767813c4462e881cf32",
            "NCAPPKEY": "X82Y__720413049743749f56fc6c438bc76003",
            "isUpgrade": "false"
        }
    </script>
<script crossorigin="" src="https://laz-g-cdn.alicdn.com/bsop-static/sufei-punish/0.1.6/build/punishpage.min.js"></script>
<script crossorigin="" src="https://laz-g-cdn.alicdn.com/dt/tracker/4.0.0/??tracker.Tracker.js,tracker.interfaceTrackerPlugin.js,tracker.performanceTrackerPlugin.js" type="text/javascript"></script>
<script type="text/javascript">
    var tracker = new window.Tracker({uidResolver: function() {
    // 具体获取 userId 逻辑自行实现
  return "5078531ed0bbc731527d465153f47451";
  },  pid: 'punish-page', plugins: [[window.interfaceTrackerPlugin], [window.performanceTrackerPlugin, { sampleRate: 1 }]] });
    tracker.install();
    tracker.log({
      code: 11,  // 系统自动生成,请勿修改 0.5%
      c1: '5078531ed0bbc731527d465153f47451',
      c2: '0.1.6',
      msg: 'Daraz-web 页面-验证码',  // 异常信息,推荐传入
      sampleRate: 1.00,  // 目前采样率为 100.00%
    });
  </script>
<script type="text/javascript">
    var tracker = new window.Tracker({uidResolver: function() {
    // 具体获取 userId 逻辑自行实现
  return "5078531ed0bbc731527d465153f47451";
  },  pid: 'punish-page', plugins: [[window.interfaceTrackerPlugin], [window.performanceTrackerPlugin, { sampleRate: 1 }]] });
    tracker.install();
    tracker.log({
      code: 11,  // 系统自动生成,请勿修改 0.5%
      c1: '5078531ed0bbc731527d465153f47451',
      c2: '0.1.6',
      msg: 'Daraz-web 页面-验证码',  // 异常信息,推荐传入
      sampleRate: 1.00,  // 目前采样率为 100.00%
    });
  </script>

Therefore, when you hit the rate limit, you start getting that exception because there is no JSON object in the output that you are parsing.

I think this can be potentially solved by implementing a timeout when crawling the platform, but it usually requires some trial & error to get it right, as it is highly dependent on how the target website deals with rate-limiting.

  • Related