Home > Blockchain >  Get data from a <script> var with BeautifulSoup
Get data from a <script> var with BeautifulSoup

Time:09-23

Url = https://letterboxd.com/film/dogville/

I want to get movie name and release year with BeautifulSoup.

import requests
from bs4 import BeautifulSoup
url = 'https://letterboxd.com/film/dogville/'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
soup.find_all("script")[10] 

Output:

<script>
    var filmData = { id: 51565, name: "Dogville", gwiId: 39220, releaseYear: "2003", posterURL: "/film/dogville/image-150/", path: "/film/dogville/" };



</script>

I managed to get the <script> block but i don't know how to get name and releaseYear. How can i get them?

CodePudding user response:

The problem is that bs4 is not a javascript parser. You reach its boundary than you need smt else, a javascript parser. Some weaker solution may use json module from the standard library to convert the string dictionary into a python dictionary.

Once you get the string containing js-code or you regex it to extract the dictionary like string or other way around.

Here the other way around

...
script_text = str(soup.find(script, string=True).string) # or what ever

# here the template
script_text = '    var filmData = { id: 51565, name: "Dogville", gwiId: 39220, releaseYear: "2003", posterURL: "/film/dogville/image-150/", path: "/film/dogville/" };'

script_text = script_text.strip()[:-1]
# substring starting after the 1st {
script_text = script_text[script_text.find('{') 1:]

script_text=script_text.replace(':', '=')
# find index of the closing }
i_close = len(script_text) - script_text[::-1].find('}')
# 
script_text_d = 'dict('   script_text[:i_close-1]   ')'
# evaluate the string
script_text_d = eval(script_text_d)

print(script_text_d)
print(script_text_d['name'])

Output

{'id': 51565, 'name': 'Dogville', 'gwiId': 39220, 'releaseYear': '2003', 'posterURL': '/film/dogville/image-150/', 'path': '/film/dogville/'}
Dogville

Remarks:

  • I choose to make the the dictionary construnction via built-it function, dict to avoid extra works
  • for json.loads I guess you need to put it in the form {} but then you need to double quotes all the key-like string
  • use javascript parsers
  • Related