Home > Software engineering >  How to parse the transcript from tedtalks
How to parse the transcript from tedtalks

Time:05-30

can't parse the transcript of a video from https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript

the requests won't see the span class where the text actually is. What could be the problem?

import requests

url = 'https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript'
page = requests.get(url)
print(page.content)

Is there any way to reach the transcript? Thank you. I need to reach this no atrribute found

CodePudding user response:

That's because the data is not loaded via the link you're using, but via a call to their GraphQL instance.

Using curl, you can fetch the data like so:

curl 'https://www.ted.com/graphql?operationName=Transcript&variables={"id":"alexis_nikole_nelson_a_flavorful_field_guide_to_foraging","language":"en"}&extensions={"persistedQuery":{"version":1,"sha256Hash":"18f8e983b84c734317ae9388c83a13bc98702921b141c2124b3ce4aeb6c48ef6"}}' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br' -H 'Referer: https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript' -H 'content-type: application/json' -H 'client-id: Zenith production' -H 'x-operation-name: Transcript' --output - | gzip -d

Note, the URL is urlencoded. You can import from urllib.parse import quote to use the quote() method to urlencode a string in python.

So simply translate the above curl command to python. There's no magic, simply set the correct headers. If you're lazy, you can also use this online converter, to convert a curl command to python code.

This produces:

import requests
from requests.structures import CaseInsensitiveDict

url = "https://www.ted.com/graphql?operationName=Transcript&variables={"id":"alexis_nikole_nelson_a_flavorful_field_guide_to_foraging","language":"en"}&extensions={"persistedQuery":{"version":1,"sha256Hash":"18f8e983b84c734317ae9388c83a13bc98702921b141c2124b3ce4aeb6c48ef6"}}"

headers = CaseInsensitiveDict()
headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
headers["Accept"] = "*/*"
headers["Accept-Language"] = "en-US,en;q=0.5"
headers["Accept-Encoding"] = "gzip, deflate, br"
headers["Referer"] = "https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript"
headers["content-type"] = "application/json"
headers["client-id"] = "Zenith production"
headers["x-operation-name"] = "Transcript"

resp = requests.get(url, headers=headers)
print(resp.content)

Output:

b'{"data":{"translation":{"id":"209255","language" ...
  • Related