Home > Back-end >  How to remove <br> tag but keep everything within the same paragraph
How to remove <br> tag but keep everything within the same paragraph

Time:02-14

It's my first time posting so hopefully I'm able to make this as clear as possible.

For an assignment I have to use BeautifulSoup to crawl a made-up webpage and extract all the titles and abstracts from each publication page. In general I've been able to do this by finding the abstract paragraph on each page and just appending that to an empty list. However, one of the pages has broken the abstract into several small chunks separated by
.

This is annoying because instead of considering it as one abstract, it is considered as 5 different ones, so it affects all the following publications and the title-abstracts don't match up.

I've tried extracting the
with the following code:

#text from abstract
abstracttext = []

for url in final_list:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
#    print(soup.prettify())

    necessarytext = soup.find("p")
    
    for e in soup.find_all('br'):
        e.extract()

#    print(necessarytext)
    
    for text in necessarytext:
        abstracttext.append(text)

#print(abstracttext)

If I look at 'necessarytext' now it appears the problem is solved as all the sentences are within the same paragraph. However once I go ahead and append everything to the empty list, the sentences are again separated out as if they were different paragraphs and throw everything off.

Does anybody have any ideas why this might be? Is there any way of removing the
but ensuring everything stays within the same paragraph, or is there a general-purpose way of concatenating all these sentences together? Sorry if I've been a bit unclear and I appreciate any help you can send my way.

EDIT: The 'url' in the code comes from a previous web-scraping I did. The publications are grouped by topic so I was able to go through each of the topics and extract the publication pages from there. All the unique URLs were added to a list called 'final_list' and so this for loop is supposed to go through each of these publication pages to extract the abstract. Hope that is clearer.

CodePudding user response:

To remove the <br>s from the <p> you could use extract() or decompose():

...
necessarytext = soup.find("p")

for x in necessarytext:
    if x.name == 'br':
        x.extract()
        ##or 
        ##x.decompose()

abstracttext.append(necessarytext)
...

Note Cause it was not that clear - if you do not need the <p> at all just call abstracttext.append(soup.find("p").text) this will give plain text of <p> without <br/>

Example

import requests
from bs4 import BeautifulSoup

abstracttext = []

html='''<p>a <br/> b <br/> c</p>'''
soup = BeautifulSoup(html, "html.parser")
necessarytext = soup.find("p")
for x in necessarytext:
    if x.name == 'br':
        x.decompose()

abstracttext.append(necessarytext)

print(abstracttext)

Output

[<p>a  b  c</p>]
  • Related