Home > Back-end >  Why is text inside HTML tags getting translated when requested while Web Scraping?
Why is text inside HTML tags getting translated when requested while Web Scraping?

Time:08-17

I am learning a little bit about web scraping and currently i am trying to do a small project. So with this code I am storing the HTML code inside soup variable.

source=requests.get(URL)
soup=BeautifulSoup(source.text,'html.parser')

The problem is: when I inspect the code inside my browser it looks like this:

<a ...>The Godfather</a>

but when I try to use it in my program only the text inside tag (The Godfather) gets translated to my native language (Кум):

<a ...>Кум</a>

I dont want it to get translated. My browser is completely in English and I have no idea why is this happening. Any help would be much appreciated!

CodePudding user response:

Try to specify Accept-Language HTTP header in your request:

import requests
from bs4 import BeautifulSoup


url = "https://www.imdb.com/search/title/?groups=top_100&sort=user_rating,desc"

headers = {"Accept-Language": "en-US,en;q=0.5"}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")


for h3 in soup.select("h3"):
    print(h3.get_text(strip=True, separator=" "))

Prints:

1. The Shawshank Redemption (1994)
2. The Godfather (1972)
3. The Dark Knight (2008)
4. The Lord of the Rings: The Return of the King (2003)
5. Schindler's List (1993)
6. The Godfather Part II (1974)

...
  • Related