Home > Back-end >  How to Extract a Division from Html with BeautifulSoup
How to Extract a Division from Html with BeautifulSoup

Time:06-24

I am trying to extract the 'meanings' section of a dictionary entry from a html file using beautifulsoup but it is giving me some trouble. Here is a summary of what I have tried so far:

  • I right click on the dictionary entry page below and save the webpage to my Python directory as 'aufmachen.html'

https://www.duden.de/rechtschreibung/aufmachen

  • Within the source code of this webpage, the section that I am trying to extract starts from line 1042 with the expression
  • I wrote the code below but neither tags nor Bedeutungen contains any search results.
import requests
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup

with open("aufmachen.html",encoding="utf8") as f:
    doc = BeautifulSoup(f,"html.parser")


tags = doc.body.findAll(text = '<div   id="bedeutungen">')

print(tags)

Bedeutungen = doc.body.findAll("div", {"id": "bedeutungen"})

print(Bedeutungen)

Could you please help me with this problem?

Thanks for your time in advance.

CodePudding user response:

The main bug in your code is that you send BS a file, not a string. Call .read() on your file to get a string.

with open("aufmachen.html", "r",encoding="utf8") as f:
    doc = BeautifulSoup(f.read(),"html.parser")

However it seems you want to pull in the HTML file from a URL, not a file on your computer. This can be done like this:

from bs4 import BeautifulSoup
import requests

url = "https://www.duden.de/rechtschreibung/aufmachen"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

Bedeutungen = soup.body.findAll("div", {"id": "bedeutungen"})

print(Bedeutungen)

Your first call to .findAll() didn't work because the text kwarg looks for text inside the tag, not a tag itself. The following works, but there's no particular reason to use this over the other shown above.

tags = soup.body.findAll("div", class_="division", id="bedeutungen")
  • Related