Home > Software design >  Web Scrape with BS alternative method to .find_next()
Web Scrape with BS alternative method to .find_next()

Time:08-20

I am trying to scrape the movies from https://www.imdb.com/list/ls055592025/ and this is my code. It works, but is there a way to write it without the multiple .find_next() function?

import bs4

url = 'https://www.imdb.com/list/ls055592025/'

data = []

res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'lxml')

for e in soup.find_all(attrs={'class':"lister-item mode-detail"}):
    data.append({
    'movie_title': e.h3.a.get_text(strip=True),
    'release_date': e.h3.find_next('span').find_next('span').get_text(strip=True).strip('(').strip(')'),
    'movie_duration': e.p.find_next('span').find_next('span').find_next('span').get_text(strip=True),
    'movie_genre': e.p.find_next(r'span').find_next('span').find_next('span').find_next('span').find_next('span').get_text(strip=True)

})

for d in data:
    print(d)```

CodePudding user response:

Select the elements more specific by its class to avoid chaining find_next(), in this case they look static and not that dynamic:

'movie_genre': e.select_one('.genre').get_text(strip=True)

Alternative could be to use pseudo classes like :nth-of-type() or :last-of-type but be aware, that the structure has to be always the same:

'release_date': e.h3.select_one('span:last-of-type').get_text(strip=True).strip('(').strip(')'),
'movie_duration': e.select_one('p span:nth-of-type(3)').get_text(strip=True),
'movie_genre': e.select_one('p span:last-of-type').get_text(strip=True)
Example
import bs4
import requests
url = 'https://www.imdb.com/list/ls055592025/'

data = []
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'lxml')

for e in soup.find_all(attrs={'class':"lister-item mode-detail"}):
    data.append({
    'movie_title': e.h3.a.get_text(strip=True),
    'release_date': e.select_one('.lister-item-year').get_text(strip=True).strip('(').strip(')'),
    'movie_duration': e.select_one('.runtime').get_text(strip=True),
    'movie_genre': e.select_one('.genre').get_text(strip=True)

})   
data
  • Related