Home > OS >  Scrape Text and save File with Bold Text Intact?
Scrape Text and save File with Bold Text Intact?

Time:02-13

I am very new to Python and webscraping. I have tried to search for an answer, but cannot find it. It might be because I don't know the terminology to ask the right question.

I am trying to web scrape using python - beautiful soup in order to extract the English transliterations of verb tables from a website (https://www.pealim.com/dict/28-lavo/) that conjugates modern Hebrew verbs. I am then trying to save the text to a txt file. The sticking point is I am trying to get the bold formatting tag to remain intact during the scraping/saving to file, because they are important to know where the stress falls in the word.

Here is an example of what I am getting: ba'im

And here is what I would like: ba'im

I'm including an image because when I post the HTML code, it's automatically rendering it:

What I'm looking to do

By looking around the forums, I have come up with code gets me close to what I need, but I cannot figure out how to get the bold tags in there as well.

import requests
from bs4 import BeautifulSoup as bs

#load webpage content
r = requests.get("https://www.pealim.com/dict/28-lavo/")

#Convert to a soup object
soup = bs(r.content)

#Find the transliterations from the verb tables with the stress bolded
mine = [element.text for element in soup.find_all("div", "transcription")]

#Save to file
with open("lavo.txt", "w") as output:
    for i in mine:
        output.write('%s\n' % i)

CodePudding user response:

You can use .contents property, cast it to string and join it. For example:

import requests
from bs4 import BeautifulSoup as bs

# load webpage content
r = requests.get("https://www.pealim.com/dict/28-lavo/")

# Convert to a soup object
soup = bs(r.content, "html.parser")

# Find the transliterations from the verb tables with the stress bolded
mine = [
    "".join(map(str, element.contents))
    for element in soup.find_all("div", "transcription")
]

with open("lavo.txt", "w") as output:
    for i in mine:
        output.write("%s\n" % i)

Saves lavo.txt:

b<b>a</b>
ba'<b>a</b>
ba'<b>i</b>m
ba'<b>o</b>t
b<b>a</b>ti
b<b>a</b>nu
b<b>a</b>ta
b<b>a</b>t
bat<b>e</b>m

...
  • Related