How to wrap each sentence within a tag with its custom mark colour?-CodePudding

I am using beautiful soup and requests to load the HTML of a website (e.g. https://en.wikipedia.org/wiki/Elephant). I want to mimic this page, but I want to colourise the sentences within the 'p' tags (paragraphs).

For this I am using spacy to break the text down in sentences. I select a color (a probability colour based on a binary deep learning classifier for those interested).

def get_colorized_p(p):
    
    doc = nlp(p.text) # p is the beautiful soup p tag
    string = '<p>'
    for sentence in doc.sents:
        # The prediction value in anything within 0 to 1.
        prediction = classify(sentence.text, model=model, pred_values=True)[1][1].numpy()
        # I am using a custom function to map the prediction to a hex colour.
        color = get_hexcolor(prediction)
        string  = f'<mark style="background: {color};">{sentence.text} </mark> '
    string  = '</p>'
    return string # I create a new long string with the markup

I create a new long string with HTML markup within p tags. I now want to replace the 'old' element in the beautiful soup object. I do this with a simple loop:

for element in tqdm_notebook(soup.findAll()):
    if element.name == 'p':
        if len(element.text.split()) > 2: 
            element = get_colorized_p(element)

This does not give any errors, however when I render the HTML file. The HTML file is displayed without the markup

I am using jupyter to quickly display the HTML file:

from IPython.display import display, HTML

display(HTML(html_file))

However this does not work. I did verify the returned string by the get_colorized_p. When I use this on a single p element and render it, it works fine. But I want to insert the string into the beautiful soup object.

I hope anyone can shine some light on this problem. It goes wrong with the replacement of the element within the loop. However, I have no clue how to fix it.

An example of a rendered string example just in case:

<p><mark style="background: #edf8fb;">Elephants are the largest existing land animals.</mark><mark style="background: #f1fafc;">Three living species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.</mark><mark style="background: #f3fafc;">They are an informal grouping within the proboscidean family Elephantidae.</mark><mark style="background: #f3fafc;">Elephantidae is the only surviving family of proboscideans; extinct members include the mastodons.</mark><mark style="background: #eff9fb;">Elephantidae also contains several extinct groups, including the mammoths and straight-tusked elephants.</mark><mark style="background: #68c3a6;">African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.</mark><mark style="background: #56ba91;">The distinctive features of all elephants include a long proboscis called a trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.</mark><mark style="background: #d4efec;">The trunk is used for breathing, bringing food and water to the mouth, and grasping objects.</mark><mark style="background: #e7f6f9;">Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.</mark><mark style="background: #d9f1f0;">The large ear flaps assist in maintaining a constant body temperature as well as in communication.</mark><mark style="background: #e5f5f9;">The pillar-like legs carry their great weight.</mark><mark style="background: #72c7ad;"> </mark></p>

CodePudding user response：

Like that idea and the color scheme - Main issue in my opinion is that you try to replace the tag with a string, while you should replace_with() a bs4 object to spice up your soup getting a new flavor:

for element in tqdm_notebook(soup.find_all()):
    if element.name == 'p':
        if len(element.text.split()) > 2: 
            element.replace_with(BeautifulSoup(get_colorized_p(element), 'html.parser'))

Convert your soup back to a string and try to display it:

display(HTML(str(soup)))

In newer code avoid old syntax findAll() instead use find_all() - For more take a minute to check docs

Example

from bs4 import BeautifulSoup
from IPython.display import display, HTML

html = '''
    <p>Elephants are the largest ...</p>
'''
soup = BeautifulSoup(html, 'html.parser')

def get_colorized_p(element):
    ### processing and returning of result str
    return '<p><mark style="background: #edf8fb;">Elephants are the largest existing land animals.</mark><mark style="background: #f1fafc;">Three living species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.</mark><mark style="background: #f3fafc;">They are an informal grouping within the proboscidean family Elephantidae.</mark><mark style="background: #f3fafc;">Elephantidae is the only surviving family of proboscideans; extinct members include the mastodons.</mark><mark style="background: #eff9fb;">Elephantidae also contains several extinct groups, including the mammoths and straight-tusked elephants.</mark><mark style="background: #68c3a6;">African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.</mark><mark style="background: #56ba91;">The distinctive features of all elephants include a long proboscis called a trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.</mark><mark style="background: #d4efec;">The trunk is used for breathing, bringing food and water to the mouth, and grasping objects.</mark><mark style="background: #e7f6f9;">Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.</mark><mark style="background: #d9f1f0;">The large ear flaps assist in maintaining a constant body temperature as well as in communication.</mark><mark style="background: #e5f5f9;">The pillar-like legs carry their great weight.</mark><mark style="background: #72c7ad;"> </mark></p>'

for element in soup.find_all():
    if element.name == 'p':
        if len(element.text.split()) > 2: 
            element.replace_with(BeautifulSoup(get_colorized_p(element), 'html.parser'))

display(HTML(str(soup)))

Not exactly the same but very close to the behavior in your question: How to wrap initial of each word in a specific tag with a <b>?

CodePudding user response：

element = get_colorized_p(element) assigns a local variable which is then never used again/overwritten by the for-loop variable. You need to save the processed elements, e.g. by concatenating them into one string.

html = ''
for element in tqdm_notebook(soup.findAll()):
    if element.name == 'p' and len(element.text.split()) > 2: 
        html  = get_colorized_p(element)
    else:
        html  = element.text

display(HTML(html))