Home > OS >  How to follow links with web scraping?
How to follow links with web scraping?

Time:12-08

I am trying to get the horror genre movie scripts from the following page: IMSDb. The problem is that to get to the movie script you have to click 2 times, that is, click on the name of the movie and then click again where the script of that movie is.

I already got the links of the first part and my question is according to that. How do I track a link? Should I do it with 'requests.get(url)' or is there another way to do it? I'm new to web scraping and I wanted to know if you can guide me to achieve my goal. I attach my code that I make to get the links of the first part. I am using Google Colab.

import requests
from bs4 import BeautifulSoup
import lxml

website = 'https://imsdb.com/genre/Horror'
resultado = requests.get(website)
contenido = resultado.text
# contenido

soup = BeautifulSoup(contenido, 'lxml')
soup.prettify()

info = soup.find_all('td', {'valign':'top'})
# info

len(info)

info2 = info[-1]
# info2

lista = []
for link in info2.find_all('a'):
    aux = link.get('href')
    lista.append(aux)

# lista

CodePudding user response:

You can simply compile the links to the scripts from the urls in lista. Let's compare a regular url with a script url:

  • https://imsdb.com/Movie Scripts/Jurassic Park: The Lost World Script.html
  • https://imsdb.com/scripts/Jurassic-Park-The-Lost-World.html

We see that we need to:

  1. Replace the base url
  2. Remove colons
  3. Remove Script
  4. Replace spaces ( ) with -

So this should give you the links to scrape:

script_links = ['https://imsdb.com/scripts/'   i.split('/')[-1].replace(':', '').replace(' Script', '').replace(' ','-') for i in lista]
  • Related