I am very new to Webscrapping in python, I have no error in the code but the out seems to be correct but the problem is with the language it's ouptput. So I tried my hand with IMDB the popular website. I inspect the html code, I want to extract the name of the movie, rating, etc. This is the website for IMBD with 250 movies and rating https://www.imdb.com/chart/top/ My code to scrape the data as follows, I use the module, BeautifulSoup and request
# We use the request module to access the website IMDB
source = requests.get('https://www.imdb.com/chart/top/')
# Let capture error say if the website address having some issues
source.raise_for_status()
# The following will return html parser code,
soup = BeautifulSoup(source.text, 'html.parser')
movies = soup.find('tbody', class_= 'lister-list').find_all('tr')
#print(len(movies))
# Let iterate through each tr tag
for movie in movies:
# Use break to check only the first element of the list
#break
name = movie.find('td', class_='titleColumn').a.text
rank = movie.find('td', class_='titleColumn').get_text(strip=True).split('.')[0]
year = movie.find('td', class_='titleColumn').span.text.strip('()')
rating = movie.find('td', class_ ="ratingColumn imdbRating").strong.text
print(name, rank, year, rating)
Everything in the website is English how come my output is foreign language?
The output is the following
刺激1995 1 1994 9.2
教父 2 1972 9.2
黑暗騎士 3 2008 9.0
教父第二集 4 1974 9.0
十二怒漢 5 1957 8.9
辛德勒的名單 6 1993 8.9
魔戒三部曲:王者再臨 7 2003 8.9
黑色追緝令 8 1994 8.9
魔戒首部曲:魔戒現身 9 2001 8.8
黃昏三鏢客 10 1966 8.8
阿甘正傳 11 1994 8.8
鬥陣俱樂部 12 1999 8.7
全面啟動 13 2010 8.7
魔戒二部曲:雙城奇謀 14 2002 8.7
星際大戰五部曲:帝國大反擊 15 1980 8.7
駭客任務 16 1999 8.7
四海好傢伙 17 1990 8.7
飛越杜鵑窩 18 1975 8.6
火線追緝令 19 1995 8.6
七武士 20 1954 8.6
風雲人物 21 1946 8.6
沉默的羔羊 22 1991 8.6
CodePudding user response:
I assume that your IP is located in China? There is a chance that IMBD does geo-location and set your language to Mandarin.
You have the same problem with this person, and I think the same answer apply. Add an header to your request and set the language to English.
Python change Accept-Language using requests
CodePudding user response:
You can add Accept-Language
to your header before requesting.
headers = {'Accept-Language': 'en-US,en;q=0.5'}
source = requests.get('https://www.imdb.com/chart/top/', headers=headers)