I am using beautiful soup and requests to load the HTML of a website (e.g. https://en.wikipedia.org/wiki/Elephant). I want to mimic this page, but I want to colourise the sentences within the 'p' tags (paragraphs).
For this I am using spacy to break the text down in sentences. I select a color (a probability colour based on a binary deep learning classifier for those interested).
def get_colorized_p(p):
doc = nlp(p.text) # p is the beautiful soup p tag
string = '<p>'
for sentence in doc.sents:
# The prediction value in anything within 0 to 1.
prediction = classify(sentence.text, model=model, pred_values=True)[1][1].numpy()
# I am using a custom function to map the prediction to a hex colour.
color = get_hexcolor(prediction)
string = f'<mark style="background: {color};">{sentence.text} </mark> '
string = '</p>'
return string # I create a new long string with the markup
I create a new long string with HTML markup within p tags. I now want to replace the 'old' element in the beautiful soup object. I do this with a simple loop:
for element in tqdm_notebook(soup.findAll()):
if element.name == 'p':
if len(element.text.split()) > 2:
element = get_colorized_p(element)
This does not give any errors, however when I render the HTML file. The HTML file is displayed without the markup
I am using jupyter to quickly display the HTML file:
from IPython.display import display, HTML
display(HTML(html_file))
However this does not work. I did verify the returned string by the get_colorized_p
. When I use this on a single p element and render it, it works fine. But I want to insert the string into the beautiful soup object.
I hope anyone can shine some light on this problem. It goes wrong with the replacement of the element within the loop. However, I have no clue how to fix it.
An example of a rendered string example just in case:
<p><mark style="background: #edf8fb;">Elephants are the largest existing land animals.</mark><mark style="background: #f1fafc;">Three living species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.</mark><mark style="background: #f3fafc;">They are an informal grouping within the proboscidean family Elephantidae.</mark><mark style="background: #f3fafc;">Elephantidae is the only surviving family of proboscideans; extinct members include the mastodons.</mark><mark style="background: #eff9fb;">Elephantidae also contains several extinct groups, including the mammoths and straight-tusked elephants.</mark><mark style="background: #68c3a6;">African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.</mark><mark style="background: #56ba91;">The distinctive features of all elephants include a long proboscis called a trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.</mark><mark style="background: #d4efec;">The trunk is used for breathing, bringing food and water to the mouth, and grasping objects.</mark><mark style="background: #e7f6f9;">Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.</mark><mark style="background: #d9f1f0;">The large ear flaps assist in maintaining a constant body temperature as well as in communication.</mark><mark style="background: #e5f5f9;">The pillar-like legs carry their great weight.</mark><mark style="background: #72c7ad;"> </mark></p>
CodePudding user response:
Like that idea and the color scheme - Main issue in my opinion is that you try to replace the tag
with a string
, while you should replace_with()
a bs4 object
to spice up your soup
getting a new flavor:
for element in tqdm_notebook(soup.find_all()):
if element.name == 'p':
if len(element.text.split()) > 2:
element.replace_with(BeautifulSoup(get_colorized_p(element), 'html.parser'))
Convert your soup
back to a string and try to display it:
display(HTML(str(soup)))
In newer code avoid old syntax findAll()
instead use find_all()
- For more take a minute to check docs
Example
from bs4 import BeautifulSoup
from IPython.display import display, HTML
html = '''
<p>Elephants are the largest ...</p>
'''
soup = BeautifulSoup(html, 'html.parser')
def get_colorized_p(element):
### processing and returning of result str
return '<p><mark style="background: #edf8fb;">Elephants are the largest existing land animals.</mark><mark style="background: #f1fafc;">Three living species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.</mark><mark style="background: #f3fafc;">They are an informal grouping within the proboscidean family Elephantidae.</mark><mark style="background: #f3fafc;">Elephantidae is the only surviving family of proboscideans; extinct members include the mastodons.</mark><mark style="background: #eff9fb;">Elephantidae also contains several extinct groups, including the mammoths and straight-tusked elephants.</mark><mark style="background: #68c3a6;">African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.</mark><mark style="background: #56ba91;">The distinctive features of all elephants include a long proboscis called a trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.</mark><mark style="background: #d4efec;">The trunk is used for breathing, bringing food and water to the mouth, and grasping objects.</mark><mark style="background: #e7f6f9;">Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.</mark><mark style="background: #d9f1f0;">The large ear flaps assist in maintaining a constant body temperature as well as in communication.</mark><mark style="background: #e5f5f9;">The pillar-like legs carry their great weight.</mark><mark style="background: #72c7ad;"> </mark></p>'
for element in soup.find_all():
if element.name == 'p':
if len(element.text.split()) > 2:
element.replace_with(BeautifulSoup(get_colorized_p(element), 'html.parser'))
display(HTML(str(soup)))
Not exactly the same but very close to the behavior in your question: How to wrap initial of each word in a specific tag with a <b>?
CodePudding user response:
element = get_colorized_p(element)
assigns a local variable which is then never used again/overwritten by the for-loop variable. You need to save the processed elements, e.g. by concatenating them into one string.
html = ''
for element in tqdm_notebook(soup.findAll()):
if element.name == 'p' and len(element.text.split()) > 2:
html = get_colorized_p(element)
else:
html = element.text
display(HTML(html))