I want to scrape this website https://lens.zhihu.com/api/v4/videos/1123764263738900480 to get the play_url
using Python.
This website has a very quick redirect and the url is unchanged. The play_url
in the original page is invalid, if you want to visit it, you will see "You do not have permission...". So I use waitForLoad
function in the program to handle the redirect, but as I see in 1.txt
in the program, the scraped content doesn't have the play_url
I want and the url in it is still invalid.
Here is the play_url
I want, in the tag a
and the value of class
is url
: image
Here is the code I use:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
def waitForLoad(driver):
count = 0
while True:
count = 1
if count > 20:
return
time.sleep(.5)
service = Service(executable_path='C:\\Users\\X\\chromedriver_win32\\chromedriver.exe')
options = Options()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("--headless")
options.add_argument('user-agent="Mozilla/5.0"')
driver = webdriver.Chrome(service=service, options=options)
driver.get('https://lens.zhihu.com/api/v4/videos/1123764263738900480')
waitForLoad(driver)
pageSource = driver.page_source
driver.quit()
bs = BeautifulSoup(pageSource, 'html.parser')
with open('C:\\Users\\X\\Desktop\\1.txt', 'a', encoding='utf-8') as file:
file.write(f"{bs}")
play_url = bs.find('a', {'class': 'url'}).get("title")
print(play_url)
and it returns:
Traceback (most recent call last):
File "c:\Users\X\Desktop\Handle redirect\stackoverflow.py", line 31, in <module>
play_url = bs.find('a', {'class': 'url'}).get("title")
AttributeError: 'NoneType' object has no attribute 'get'
What's the problem with it?
CodePudding user response:
You can use Requests to get the response from that url and extract play_url.
import requests
url = "https://lens.zhihu.com/api/v4/videos/1123764263738900480"
headers = {
"Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36"
}
response = requests.get(url, headers=headers)
data = response.json()
play_url = data['playlist']['LD']['play_url']
print(play_url)