I have scraped a data from website and I would like to save all of data. However, it only saves the last value of the data. I have made an empty dictionary but i'm struggling with adding element in empty dictionary
Here's my code
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy
try:
source = requests.get('https://www.imdb.com/chart/top/')
source.raise_for_status()
soup = BeautifulSoup(source.text,'html.parser')
movies = soup.find('tbody', class_="lister-list").find_all('tr')
data = {}
for movie in movies:
name = movie.find('td', class_='titleColumn').a.text
rank = movie.find('td', class_="titleColumn").get_text(strip=True).split('.')[0]
year = movie.find('td', class_="titleColumn").span.text.strip('()')
rating = movie.find('td', class_="ratingColumn imdbRating").strong.text
except Exception as e:
print(e)
print(data)
CodePudding user response:
Close to your goal, simply add the information to your dict and append it with each iteration to a list. So you are able to create a dataframe:
for movie in movies:
data.append({
'name': movie.find('td', class_='titleColumn').a.text,
'rank': movie.find('td', class_="titleColumn").get_text(strip=True).split('.')[0],
'year': movie.find('td', class_="titleColumn").span.text.strip('()'),
'rating': movie.find('td', class_="ratingColumn imdbRating").strong.text
})
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
source = requests.get('https://www.imdb.com/chart/top/')
source.raise_for_status()
soup = BeautifulSoup(source.text,'html.parser')
movies = soup.find('tbody', class_="lister-list").find_all('tr')
data = []
for movie in movies:
data.append({
'name': movie.find('td', class_='titleColumn').a.text,
'rank': movie.find('td', class_="titleColumn").get_text(strip=True).split('.')[0],
'year': movie.find('td', class_="titleColumn").span.text.strip('()'),
'rating': movie.find('td', class_="ratingColumn imdbRating").strong.text
})
pd.DataFrame(data)
Output
name | rank | year | rating | |
---|---|---|---|---|
0 | Die Verurteilten | 1 | 1994 | 9.2 |
1 | Der Pate | 2 | 1972 | 9.2 |
2 | The Dark Knight | 3 | 2008 | 9 |
3 | Der Pate 2 | 4 | 1974 | 9 |
4 | Die zwölf Geschworenen | 5 | 1957 | 8.9 |
....
CodePudding user response:
you can replace your for loop with this one to add nested dictionaries, so you can find your movie info by name, then what info you wanted from it
for movie in movies:
name = movie.find('td', class_='titleColumn').a.text
data[name] = {}
rank = movie.find('td', class_="titleColumn").get_text(strip=True).split('.')[0]
year = movie.find('td', class_="titleColumn").span.text.strip('()')
rating = movie.find('td', class_="ratingColumn imdbRating").strong.text
data[name]["rank"] = rank
data[name]["year"] = year
data[name]["rating"] = rating
print(data)
CodePudding user response:
I would suggest you to store the cur movie in data but make the name of the movie as a key
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy
try:
source = requests.get('https://www.imdb.com/chart/top/')
source.raise_for_status()
soup = BeautifulSoup(source.text,'html.parser')
movies = soup.find('tbody', class_="lister-list").find_all('tr')
data = {}
for movie in movies:
name = movie.find('td', class_='titleColumn').a.text
rank = movie.find('td', class_="titleColumn").get_text(strip=True).split('.')[0]
year = movie.find('td', class_="titleColumn").span.text.strip('()')
rating = movie.find('td', class_="ratingColumn imdbRating").strong.text
cur = {
'name': name,
'rank': rank,
'year': year.
'rating': rating
}
# storing the cur movie in data but name of the movie as a key
data[name] = cur
except Exception as e:
print(e)
print(data)