can't parse the transcript of a video from https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript
the requests won't see the span class where the text actually is. What could be the problem?
import requests
url = 'https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript'
page = requests.get(url)
print(page.content)
Is there any way to reach the transcript? Thank you. I need to reach this no atrribute found
CodePudding user response:
That's because the data is not loaded via the link you're using, but via a call to their GraphQL instance.
Using curl, you can fetch the data like so:
curl 'https://www.ted.com/graphql?operationName=Transcript&variables={"id":"alexis_nikole_nelson_a_flavorful_field_guide_to_foraging","language":"en"}&extensions={"persistedQuery":{"version":1,"sha256Hash":"18f8e983b84c734317ae9388c83a13bc98702921b141c2124b3ce4aeb6c48ef6"}}' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br' -H 'Referer: https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript' -H 'content-type: application/json' -H 'client-id: Zenith production' -H 'x-operation-name: Transcript' --output - | gzip -d
Note, the URL is urlencoded. You can import from urllib.parse import quote
to use the quote()
method to urlencode a string in python.
So simply translate the above curl command to python. There's no magic, simply set the correct headers. If you're lazy, you can also use this online converter, to convert a curl command to python code.
This produces:
import requests
from requests.structures import CaseInsensitiveDict
url = "https://www.ted.com/graphql?operationName=Transcript&variables={"id":"alexis_nikole_nelson_a_flavorful_field_guide_to_foraging","language":"en"}&extensions={"persistedQuery":{"version":1,"sha256Hash":"18f8e983b84c734317ae9388c83a13bc98702921b141c2124b3ce4aeb6c48ef6"}}"
headers = CaseInsensitiveDict()
headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
headers["Accept"] = "*/*"
headers["Accept-Language"] = "en-US,en;q=0.5"
headers["Accept-Encoding"] = "gzip, deflate, br"
headers["Referer"] = "https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript"
headers["content-type"] = "application/json"
headers["client-id"] = "Zenith production"
headers["x-operation-name"] = "Transcript"
resp = requests.get(url, headers=headers)
print(resp.content)
Output:
b'{"data":{"translation":{"id":"209255","language" ...