I'm trying to scrape some German sentences from Glosbe.com. The requested URL contains some utf-8 characters. The website doesn't change the quoted characters to utf-8 characters after the request is done. The requested URl should look like this
https://glosbe.com/de/hu/abkühlen
But the requested URL from the website is not converted to utf-8 and the searched word is this
https://glosbe.com/de/hu/abkühlen/
The used code:
def beautifulSoapPrepare(sourceLang,destLang,phrase):
headers = {
'User-Agent': 'My User Agent 1.0',
'From': '[email protected]' # This is another valid field
}
url="https://glosbe.com/" sourceLang "/" destLang "/" urllib.parse.quote(phrase) "/"
r = requests.get(url, "lxml",headers=headers)
soup = BeautifulSoup(r.content,features="lxml")
return soup
The picture here shows the problem. The problem in picture
Could you please help me solve this issue? I want the website to search for the German word abkühlen and not this abkühlen.
Solution: The Problem was in the URL. Once I deleted the slash at the end of the URL it worked.
Before:
url="https://glosbe.com/" sourceLang "/" destLang "/" urllib.parse.quote(phrase) "/"
After:
url="https://glosbe.com/" sourceLang "/" destLang "/" urllib.parse.quote(phrase)
CodePudding user response:
Given your ultimate goal is to obtain the translation(s) of the particular word you're looking for, the following code will give you just that (and you can eventually class it, functionalize it, whatever you want):
import requests
from bs4 import BeautifulSoup as bs
url = 'https://glosbe.com/de/hu/'
word = 'abkühlen'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
r = requests.get(url word, headers=headers)
soup = bs(r.text, 'html.parser')
translations = soup.select('h3.translation')
for t in translations:
print(t.get_text(strip=True))
The result printed in terminal:
lehűl
hűtés
lehűt
hűvös
hűtés
előhűtés
Requests documentation can be found at https://requests.readthedocs.io/en/latest/
Also, BeautifulSoup docs are at: https://beautiful-soup-4.readthedocs.io/en/latest/index.html