Home > Blockchain >  The scraped content is different from what I see in browser Inspector - Python scraper with Selenium
The scraped content is different from what I see in browser Inspector - Python scraper with Selenium

Time:04-30

I want to scrape this website https://lens.zhihu.com/api/v4/videos/1123764263738900480 to get the play_url using Python.

This website has a very quick redirect and the url is unchanged. The play_url in the original page is invalid, if you want to visit it, you will see "You do not have permission...". So I use time.sleep(10) in the program to handle the redirect (This seems not to work with Requests).

(I have made a mistake. The process of redirecting I see may just caused by my Firefox browser to render the JSON code automatically. But the method I mentioned really can handle the redirect which the url is unchanged.)

But as I see in 1.txt in the program, the scraped content doesn't have the play_url I want and the url in it is still invalid.

Here is the play_url I want, can be seen in browser Inspector in the tag a and the value of its class is url: image

Here is the code I use:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

service = Service(executable_path='C:\\Users\\X\\chromedriver_win32\\chromedriver.exe')
options = Options()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("--headless")
options.add_argument('user-agent="Mozilla/5.0"')
driver = webdriver.Chrome(service=service, options=options)

driver.get('https://lens.zhihu.com/api/v4/videos/1123764263738900480')
time.sleep(10)
pageSource = driver.page_source
driver.quit()
bs = BeautifulSoup(pageSource, 'html.parser')

with open('C:\\Users\\X\\Desktop\\1.txt', 'a', encoding='utf-8') as file:
    file.write(f"{bs}")
play_url = bs.find('a', {'class': 'url'}).get("title")
print(play_url)

and it returns:

Traceback (most recent call last):
  File "c:\Users\X\Desktop\Handle redirect\stackoverflow.py", line 22, in <module>      
    play_url = bs.find('a', {'class': 'url'}).get("title")
AttributeError: 'NoneType' object has no attribute 'get'

So in the scraped content, there is no ('a', {'class': 'url'}) which I see in browser Inspector.

Why the scraped content is different from what I see in browser Inspector and how to handle it?

Edit: Thanks to the comment by Martin Evans now I know that browser handles Javascript from the source code so it looks differently from the source code. But in my case, I don't see any js links in the Network of Developer Tools. Actually, there are only two links: image. So I still have no idea about the question above.

Update: Thanks to the comment by @Sarhan I solve the problem. I use Firefox before and the browser renders the source code automatically even though the tag is not existed. I try the url in Edge and here is the result: image, just the JSON code I get from the server and there is no ('a', {'class': 'url'}) at all. Besides, thanks a lot to @Dimitar so I can get the play_url.

CodePudding user response:

You can use Requests to get the response from that url and extract play_url.

import requests

url = "https://lens.zhihu.com/api/v4/videos/1123764263738900480"

headers = {
    "Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36"
}

response = requests.get(url, headers=headers)

data = response.json()
play_url = data['playlist']['LD']['play_url']

print(play_url)
  • Related