I am trying to extract the 'meanings' section of a dictionary entry from a html file using beautifulsoup but it is giving me some trouble. Here is a summary of what I have tried so far:
- I right click on the dictionary entry page below and save the webpage to my Python directory as 'aufmachen.html'
https://www.duden.de/rechtschreibung/aufmachen
- Within the source code of this webpage, the section that I am trying to extract starts from line 1042 with the expression
- I wrote the code below but neither tags nor Bedeutungen contains any search results.
import requests
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
with open("aufmachen.html",encoding="utf8") as f:
doc = BeautifulSoup(f,"html.parser")
tags = doc.body.findAll(text = '<div id="bedeutungen">')
print(tags)
Bedeutungen = doc.body.findAll("div", {"id": "bedeutungen"})
print(Bedeutungen)
Could you please help me with this problem?
Thanks for your time in advance.
CodePudding user response:
The main bug in your code is that you send BS a file, not a string. Call .read()
on your file to get a string.
with open("aufmachen.html", "r",encoding="utf8") as f:
doc = BeautifulSoup(f.read(),"html.parser")
However it seems you want to pull in the HTML file from a URL, not a file on your computer. This can be done like this:
from bs4 import BeautifulSoup
import requests
url = "https://www.duden.de/rechtschreibung/aufmachen"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
Bedeutungen = soup.body.findAll("div", {"id": "bedeutungen"})
print(Bedeutungen)
Your first call to .findAll()
didn't work because the text
kwarg looks for text inside the tag, not a tag itself. The following works, but there's no particular reason to use this over the other shown above.
tags = soup.body.findAll("div", class_="division", id="bedeutungen")