I want to scrape this website https://lens.zhihu.com/api/v4/videos/1123764263738900480 to get the play_url
using Python.
This website has a very quick redirect and the url is unchanged. The play_url
in the original page is invalid, if you want to visit it, you will see "You do not have permission...". So I use time.sleep(10)
in the program to handle the redirect (This seems not to work with Requests).
(I have made a mistake. The process of redirecting I see may just caused by my Firefox browser to render the JSON code automatically. But the method I mentioned really can handle the redirect which the url is unchanged.)
But as I see in 1.txt
in the program, the scraped content doesn't have the play_url
I want and the url in it is still invalid.
Here is the play_url
I want, can be seen in browser Inspector in the tag a
and the value of its class
is url
: image
Here is the code I use:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
service = Service(executable_path='C:\\Users\\X\\chromedriver_win32\\chromedriver.exe')
options = Options()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("--headless")
options.add_argument('user-agent="Mozilla/5.0"')
driver = webdriver.Chrome(service=service, options=options)
driver.get('https://lens.zhihu.com/api/v4/videos/1123764263738900480')
time.sleep(10)
pageSource = driver.page_source
driver.quit()
bs = BeautifulSoup(pageSource, 'html.parser')
with open('C:\\Users\\X\\Desktop\\1.txt', 'a', encoding='utf-8') as file:
file.write(f"{bs}")
play_url = bs.find('a', {'class': 'url'}).get("title")
print(play_url)
and it returns:
Traceback (most recent call last):
File "c:\Users\X\Desktop\Handle redirect\stackoverflow.py", line 22, in <module>
play_url = bs.find('a', {'class': 'url'}).get("title")
AttributeError: 'NoneType' object has no attribute 'get'
So in the scraped content, there is no ('a', {'class': 'url'})
which I see in browser Inspector.
Why the scraped content is different from what I see in browser Inspector and how to handle it?
Edit: Thanks to the comment by Martin Evans now I know that browser handles Javascript from the source code so it looks differently from the source code. But in my case, I don't see any js links in the Network of Developer Tools. Actually, there are only two links: image. So I still have no idea about the question above.
Update: Thanks to the comment by @Sarhan I solve the problem. I use Firefox before and the browser renders the source code automatically even though the tag is not existed. I try the url in Edge and here is the result: image, just the JSON code I get from the server and there is no ('a', {'class': 'url'})
at all. Besides, thanks a lot to @Dimitar so I can get the play_url.
CodePudding user response:
You can use Requests to get the response from that url and extract play_url.
import requests
url = "https://lens.zhihu.com/api/v4/videos/1123764263738900480"
headers = {
"Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36"
}
response = requests.get(url, headers=headers)
data = response.json()
play_url = data['playlist']['LD']['play_url']
print(play_url)