I am trying to extract the cast list of the office using BeautifulSoup to scrape this imdb page https://www.imdb.com/title/tt0386676/fullcredits/?ref_=tt_ql_cl.
actors = soup.findAll('table',{'cast_list'})
How would I change this so it only gives me the actor's name? An example of the HTML is:
<td> <a href="/name/nm0933988/?ref_=ttfc_fc_cl_t1"> Rainn Wilson </a> </td>
And I would like to only extract the text 'Rainn Wilson'.
Any help is appreciated, it's my first question here so please go easy on me.
CodePudding user response:
Try this:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0"
}
url = "https://www.imdb.com/title/tt0386676/fullcredits/?ref_=tt_ql_cl"
actors = (
BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
.find('table', class_='cast_list')
.select_one("a img")["title"]
)
print(actors)
Output:
Rainn Wilson
CodePudding user response:
You can get all the actors from that page as follows:
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/title/tt0386676/fullcredits/?ref_=tt_ql_cl"
req = requests.get(url)
soup = BeautifulSoup(req.content, "lxml")
table_actors = soup.find("table", class_="simpleCreditsTable")
for td_actor in table_actors.find_all("td", class_="name"):
print(td_actor.a.get_text(strip=True))
This first locates the table holding the actors and then finds all of the name
<td>
elements. For each element, it then gets the text inside the next <a>
tag.
This would give you output starting:
Paul Feig
Randall Einhorn
Ken Kwapis
Greg Daniels