Home > Software design >  Webscrapping returns foreign language although everything is in English
Webscrapping returns foreign language although everything is in English

Time:06-21

I am very new to Webscrapping in python, I have no error in the code but the out seems to be correct but the problem is with the language it's ouptput. So I tried my hand with IMDB the popular website. I inspect the html code, I want to extract the name of the movie, rating, etc. This is the website for IMBD with 250 movies and rating https://www.imdb.com/chart/top/ My code to scrape the data as follows, I use the module, BeautifulSoup and request

# We use the request module to access the website IMDB
   source = requests.get('https://www.imdb.com/chart/top/')
# Let capture error say if the website address having some issues 
   source.raise_for_status()
   # The following will return html parser code, 
   soup = BeautifulSoup(source.text, 'html.parser')
   movies = soup.find('tbody', class_= 'lister-list').find_all('tr')
   #print(len(movies))
   # Let iterate through each tr tag 
   for movie in movies:
     
      # Use break to check only the first element of the list 
      #break
       name = movie.find('td', class_='titleColumn').a.text

       rank = movie.find('td', class_='titleColumn').get_text(strip=True).split('.')[0]

       year = movie.find('td', class_='titleColumn').span.text.strip('()')

       rating = movie.find('td', class_ ="ratingColumn imdbRating").strong.text

       print(name, rank, year, rating)

Everything in the website is English how come my output is foreign language?

The output is the following

刺激1995 1 1994 9.2
教父 2 1972 9.2
黑暗騎士 3 2008 9.0
教父第二集 4 1974 9.0
十二怒漢 5 1957 8.9
辛德勒的名單 6 1993 8.9
魔戒三部曲:王者再臨 7 2003 8.9
黑色追緝令 8 1994 8.9
魔戒首部曲:魔戒現身 9 2001 8.8
黃昏三鏢客 10 1966 8.8
阿甘正傳 11 1994 8.8
鬥陣俱樂部 12 1999 8.7
全面啟動 13 2010 8.7
魔戒二部曲:雙城奇謀 14 2002 8.7
星際大戰五部曲:帝國大反擊 15 1980 8.7
駭客任務 16 1999 8.7
四海好傢伙 17 1990 8.7
飛越杜鵑窩 18 1975 8.6
火線追緝令 19 1995 8.6
七武士 20 1954 8.6
風雲人物 21 1946 8.6
沉默的羔羊 22 1991 8.6

CodePudding user response:

I assume that your IP is located in China? There is a chance that IMBD does geo-location and set your language to Mandarin.

You have the same problem with this person, and I think the same answer apply. Add an header to your request and set the language to English.

Python change Accept-Language using requests

CodePudding user response:

You can add Accept-Language to your header before requesting.

headers = {'Accept-Language': 'en-US,en;q=0.5'}

source = requests.get('https://www.imdb.com/chart/top/', headers=headers)
  • Related