Home > Software design >  How to extract actors in a show in imdb using BeautifulSoup?
How to extract actors in a show in imdb using BeautifulSoup?

Time:06-16

I am trying to extract the cast list of the office using BeautifulSoup to scrape this imdb page https://www.imdb.com/title/tt0386676/fullcredits/?ref_=tt_ql_cl.

actors = soup.findAll('table',{'cast_list'})

How would I change this so it only gives me the actor's name? An example of the HTML is:

<td> <a href="/name/nm0933988/?ref_=ttfc_fc_cl_t1"> Rainn Wilson </a> </td>

And I would like to only extract the text 'Rainn Wilson'.

Any help is appreciated, it's my first question here so please go easy on me.

CodePudding user response:

Try this:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0"
}
url = "https://www.imdb.com/title/tt0386676/fullcredits/?ref_=tt_ql_cl"

actors = (
    BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
    .find('table', class_='cast_list')
    .select_one("a img")["title"]
)
print(actors)

Output:

Rainn Wilson

CodePudding user response:

You can get all the actors from that page as follows:

import requests
from bs4 import BeautifulSoup

url = "https://www.imdb.com/title/tt0386676/fullcredits/?ref_=tt_ql_cl"
req = requests.get(url)
soup = BeautifulSoup(req.content, "lxml")
table_actors = soup.find("table", class_="simpleCreditsTable")

for td_actor in table_actors.find_all("td", class_="name"):
    print(td_actor.a.get_text(strip=True))

This first locates the table holding the actors and then finds all of the name <td> elements. For each element, it then gets the text inside the next <a> tag.

This would give you output starting:

Paul Feig
Randall Einhorn
Ken Kwapis
Greg Daniels
  • Related