Home > Mobile >  The scraped content is different from what I see in browser Inspector- Python scraper
The scraped content is different from what I see in browser Inspector- Python scraper

Time:04-28

I want to scrape this website https://lens.zhihu.com/api/v4/videos/1123764263738900480 to get the play_url using Python.

This website has a very quick redirect and the url is unchanged. The play_url in the original page is invalid, if you want to visit it, you will see "You do not have permission...". So I use waitForLoad function in the program to handle the redirect, but as I see in 1.txt in the program, the scraped content doesn't have the play_url I want and the url in it is still invalid.

Here is the play_url I want, in the tag a and the value of class is url: image

Here is the code I use:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

def waitForLoad(driver):
    count = 0
    while True:
        count  = 1
        if count > 20:
            return
        time.sleep(.5)

service = Service(executable_path='C:\\Users\\X\\chromedriver_win32\\chromedriver.exe')
options = Options()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("--headless")
options.add_argument('user-agent="Mozilla/5.0"')
driver = webdriver.Chrome(service=service, options=options)

driver.get('https://lens.zhihu.com/api/v4/videos/1123764263738900480')
waitForLoad(driver)
pageSource = driver.page_source
driver.quit()
bs = BeautifulSoup(pageSource, 'html.parser')

with open('C:\\Users\\X\\Desktop\\1.txt', 'a', encoding='utf-8') as file:
    file.write(f"{bs}")
play_url = bs.find('a', {'class': 'url'}).get("title")
print(play_url)

and it returns:

Traceback (most recent call last):
  File "c:\Users\X\Desktop\Handle redirect\stackoverflow.py", line 31, in <module>      
    play_url = bs.find('a', {'class': 'url'}).get("title")
AttributeError: 'NoneType' object has no attribute 'get'

What's the problem with it?

CodePudding user response:

You can use Requests to get the response from that url and extract play_url.

import requests

url = "https://lens.zhihu.com/api/v4/videos/1123764263738900480"

headers = {
    "Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36"
}

response = requests.get(url, headers=headers)

data = response.json()
play_url = data['playlist']['LD']['play_url']

print(play_url)
  • Related