I am very new to Python and webscraping. I have tried to search for an answer, but cannot find it. It might be because I don't know the terminology to ask the right question.
I am trying to web scrape using python - beautiful soup in order to extract the English transliterations of verb tables from a website (https://www.pealim.com/dict/28-lavo/) that conjugates modern Hebrew verbs. I am then trying to save the text to a txt file. The sticking point is I am trying to get the bold formatting tag to remain intact during the scraping/saving to file, because they are important to know where the stress falls in the word.
Here is an example of what I am getting: ba'im
And here is what I would like: ba'im
I'm including an image because when I post the HTML code, it's automatically rendering it:
By looking around the forums, I have come up with code gets me close to what I need, but I cannot figure out how to get the bold tags in there as well.
import requests
from bs4 import BeautifulSoup as bs
#load webpage content
r = requests.get("https://www.pealim.com/dict/28-lavo/")
#Convert to a soup object
soup = bs(r.content)
#Find the transliterations from the verb tables with the stress bolded
mine = [element.text for element in soup.find_all("div", "transcription")]
#Save to file
with open("lavo.txt", "w") as output:
for i in mine:
output.write('%s\n' % i)
CodePudding user response:
You can use .contents
property, cast it to string and join it. For example:
import requests
from bs4 import BeautifulSoup as bs
# load webpage content
r = requests.get("https://www.pealim.com/dict/28-lavo/")
# Convert to a soup object
soup = bs(r.content, "html.parser")
# Find the transliterations from the verb tables with the stress bolded
mine = [
"".join(map(str, element.contents))
for element in soup.find_all("div", "transcription")
]
with open("lavo.txt", "w") as output:
for i in mine:
output.write("%s\n" % i)
Saves lavo.txt
:
b<b>a</b>
ba'<b>a</b>
ba'<b>i</b>m
ba'<b>o</b>t
b<b>a</b>ti
b<b>a</b>nu
b<b>a</b>ta
b<b>a</b>t
bat<b>e</b>m
...