I am trying to get the horror genre movie scripts from the following page: IMSDb. The problem is that to get to the movie script you have to click 2 times, that is, click on the name of the movie and then click again where the script of that movie is.
I already got the links of the first part and my question is according to that. How do I track a link? Should I do it with 'requests.get(url)' or is there another way to do it? I'm new to web scraping and I wanted to know if you can guide me to achieve my goal. I attach my code that I make to get the links of the first part. I am using Google Colab.
import requests
from bs4 import BeautifulSoup
import lxml
website = 'https://imsdb.com/genre/Horror'
resultado = requests.get(website)
contenido = resultado.text
# contenido
soup = BeautifulSoup(contenido, 'lxml')
soup.prettify()
info = soup.find_all('td', {'valign':'top'})
# info
len(info)
info2 = info[-1]
# info2
lista = []
for link in info2.find_all('a'):
aux = link.get('href')
lista.append(aux)
# lista
CodePudding user response:
You can simply compile the links to the scripts from the urls in lista
. Let's compare a regular url with a script url:
https://imsdb.com/Movie Scripts/Jurassic Park: The Lost World Script.html
https://imsdb.com/scripts/Jurassic-Park-The-Lost-World.html
We see that we need to:
- Replace the base url
- Remove colons
- Remove
Script
- Replace spaces (
-
So this should give you the links to scrape:
script_links = ['https://imsdb.com/scripts/' i.split('/')[-1].replace(':', '').replace(' Script', '').replace(' ','-') for i in lista]