How to remove all html tags except italics and keep newline information using Beautifulsoup?-CodePudding

I have the following data that I wish to strip all html elements except italics (<i>) and newline (\n) information using Beautifulsoup. I've tried multiple algorithms only to remain unsuccessful. The original data:

data = '<div><div><i><font color=""#ff086c"">Tip: amplitude is decreased in axonal neuropathies; CV&nbsp;</font></i><i><font color=""#ff086c"">&amp; latency are</font></i><i><font color=""#ff086c"">&nbsp;prolonged in&nbsp;demyelination</font></i></div></div><div><i><font color=""#ff086c""><br></font></i></div><font color=""#ff086c""><i>1. Onset latency:</i>&nbsp;</font>is the time required for an electrical stimulus to initiate an evoked potential. This reflects the conduction along the fastest fibers.&nbsp;<div><i>- Prolonged in demyelination.</i><br><div><div><br></div><div><font color=""#ff086c""><i>2. Peak latency:</i>&nbsp;</font>represents the latency along the majority of the axons and is measured at the peak of the waveform amplitude.&nbsp;</div><div><i>- Prolonged in demyelination.</i>'

I attempted to get all text alone, but I lost the italics information:

print(BeautifulSoup(data, features='html.parser').get_text('\n'))

I attempted to split the data using (<div>) tags, but this leads to random lines being duplicated:

for element in [i.get_text().replace('\xa0', ' ').lstrip().rstrip() for i in BeautifulSoup(data, features='html.parser').find_all('div')]: print(element)

The data should eventually look like this:

Tip: amplitude is decreased in axonal neuropathies; CV & latency are prolonged in demyelination

1. Onset latency: is the time required for an electrical stimulus to initiate an evoked potential. This reflects the conduction along the fastest fibers. - Prolonged in demyelination.

2. Peak latency: represents the latency along the majority of the axons and is measured at the peak of the waveform amplitude. - Prolonged in demyelination.

Where am I going wrong?

CodePudding user response：

Main issue is that plain text do not support formatting, so if you like to keep something formated it needs some markup.

strip all html elements except italics () and newline (\n)

You could unwrap() with BeautifulSoup but be aware there are no line breaks like \n just tags <br>:

for e in soup.find_all():
    if e.name not in ['br','i']:
        e.unwrap()

Example

Think closest result you could get without any regex is something like this - keep in mind that i as markup is staying in place.

from bs4 import BeautifulSoup

html='''
<div><div><i><font color=""#ff086c"">Tip: amplitude is decreased in axonal neuropathies; CV&nbsp;</font></i><i><font color=""#ff086c"">&amp; latency are</font></i><i><font color=""#ff086c"">&nbsp;prolonged in&nbsp;demyelination</font></i></div></div><div><i><font color=""#ff086c""><br></font></i></div><font color=""#ff086c""><i>1. Onset latency:</i>&nbsp;</font>is the time required for an electrical stimulus to initiate an evoked potential. This reflects the conduction along the fastest fibers.&nbsp;<div><i>- Prolonged in demyelination.</i><br><div><div><br></div><div><font color=""#ff086c""><i>2. Peak latency:</i>&nbsp;</font>represents the latency along the majority of the axons and is measured at the peak of the waveform amplitude.&nbsp;</div><div><i>- Prolonged in demyelination.</i>
'''
soup = BeautifulSoup(html)

for e in soup.find_all():
    if e.name == 'div':
        e.append(soup.new_tag('br'))

    if e.name not in ['i','br']:
        if len(e.get_text(strip=True)) == 0:
            e.extract()
        else:
            e.unwrap()

print(str(soup).replace('<br/>','\n'))

Output

<i>Tip: amplitude is decreased in axonal neuropathies; CV </i><i>&amp; latency are</i><i> prolonged in demyelination</i>

<i>1. Onset latency:</i> is the time required for an electrical stimulus to initiate an evoked potential. This reflects the conduction along the fastest fibers. <i>- Prolonged in demyelination.</i>
<i>2. Peak latency:</i> represents the latency along the majority of the axons and is measured at the peak of the waveform amplitude. 
<i>- Prolonged in demyelination.</i>