Url = https://letterboxd.com/film/dogville/
I want to get movie name and release year with BeautifulSoup.
import requests
from bs4 import BeautifulSoup
url = 'https://letterboxd.com/film/dogville/'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
soup.find_all("script")[10]
Output:
<script>
var filmData = { id: 51565, name: "Dogville", gwiId: 39220, releaseYear: "2003", posterURL: "/film/dogville/image-150/", path: "/film/dogville/" };
</script>
I managed to get the <script> block but i don't know how to get name
and releaseYear
.
How can i get them?
CodePudding user response:
The problem is that bs4 is not a javascript parser. You reach its boundary than you need smt else, a javascript parser. Some weaker solution may use json
module from the standard library to convert the string dictionary into a python dictionary.
Once you get the string containing js-code or you regex it to extract the dictionary like string or other way around.
Here the other way around
...
script_text = str(soup.find(script, string=True).string) # or what ever
# here the template
script_text = ' var filmData = { id: 51565, name: "Dogville", gwiId: 39220, releaseYear: "2003", posterURL: "/film/dogville/image-150/", path: "/film/dogville/" };'
script_text = script_text.strip()[:-1]
# substring starting after the 1st {
script_text = script_text[script_text.find('{') 1:]
script_text=script_text.replace(':', '=')
# find index of the closing }
i_close = len(script_text) - script_text[::-1].find('}')
#
script_text_d = 'dict(' script_text[:i_close-1] ')'
# evaluate the string
script_text_d = eval(script_text_d)
print(script_text_d)
print(script_text_d['name'])
Output
{'id': 51565, 'name': 'Dogville', 'gwiId': 39220, 'releaseYear': '2003', 'posterURL': '/film/dogville/image-150/', 'path': '/film/dogville/'}
Dogville
Remarks:
- I choose to make the the dictionary construnction via built-it function,
dict
to avoid extra works - for
json.loads
I guess you need to put it in the form{}
but then you need to double quotes all the key-like string - use javascript parsers