Lets say i want to scrape imdb for top 10 movies. I would like to fetch the title for the movies and the cast members for the movies.
Im easily able to fetch the title of the movies and append them to a list. The problem is i dont know how to append several values to a single row. Let say the first movie has 3 actors, the second movie has 5 actors, how can append the actors to a list so that the 3 actors on in first movie are in row 1 of the list and the 5 actors from the second movie are in row 2 and so on.
CodePudding user response:
Just a general approach, cause there is no code provided in your question.
Request the webiste (example top 250 movies) and cook your soup:
response = requests.get('http://www.imdb.com/chart/top')
soup = BeautifulSoup(response.text, 'lxml')
Create your empty list that should store your results:
data = []
Iterate over the result set of your selection (example top 250 movies) and append a dict per iteration to your list:
for e in soup.select('.titleColumn a'):
data.append({
'title':e.text,
'director':e['title'].split('(dir.),')[0],
'actors':e['title'].split('(dir.),')[-1]
})
Print your data
or create a data frame from your list of dicts and :
pd.DataFrame(data)
Output
title director actors
0 Die Verurteilten Frank Darabont Tim Robbins, Morgan Freeman
1 Der Pate Francis Ford Coppola Marlon Brando, Al Pacino
2 Der Pate 2 Francis Ford Coppola Al Pacino, Robert De Niro