Home > Software design >  Remove specific "span" tag while preserving html object
Remove specific "span" tag while preserving html object

Time:04-16

I am scraping a website using beautifulsoup & python, which has more than 100 span tags. I want to remove 2 consecutive span tag, where the first span tag has text element "READ MORE:" and the second span tag is some string.

<span>Two cars collided at low speed in Lurnea on February 25, 2019.</span>,
 <span>The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa.</span>,
 <span>READ MORE: </span>,
 <span>Long queues form at airports as one million Aussies set to fly this Easter</span>,
 <span>Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred.</span>,
 <span>The baby boy suffered fatal injuries when the driver's airbag deployed.</span>,
 <span>A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care".</span>,
 <span>READ MORE: </span>,
 <span>Four female backpackers killed in horror highway crash</span>,
 <span>The court also heard he had earned the title of a serial traffic offender.</span>,
 <span>In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs.</span>,
 <span>Watfa will serve at least two years and three months for manslaughter.</span>,
 <span>He will be eligible for parole in early 2024.</span>

For example: I want to remove below 4 tag

<span>READ MORE: </span>,
<span>Long queues form at airports as one million Aussies set to fly this Easter</span>
<span>READ MORE: </span>,
 <span>Four female backpackers killed in horror highway crash</span>

The output should be :

<span>Two cars collided at low speed in Lurnea on February 25, 2019.</span>,
 <span>The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa.</span>,
 <span>Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred.</span>,
 <span>The baby boy suffered fatal injuries when the driver's airbag deployed.</span>,
 <span>A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care".</span>,
 <span>The court also heard he had earned the title of a serial traffic offender.</span>,
 <span>In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs.</span>,
 <span>Watfa will serve at least two years and three months for manslaughter.</span>,
 <span>He will be eligible for parole in early 2024.</span>

I would be grateful if someone can help me with the logic in python.cheers

CodePudding user response:

I don't know whether it's helpful for you or not but my trying

from bs4 import BeautifulSoup

html='''
<span>Two cars collided at low speed in Lurnea on February 25, 2019.</span>
 <span>The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa.</span>
 <span>READ MORE: </span>
 <span>Long queues form at airports as one million Aussies set to fly this Easter</span>
 <span>Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred.</span>,
 <span>The baby boy suffered fatal injuries when the driver's airbag deployed.</span>,
 <span>A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care".</span>
 <span>READ MORE: </span>
 <span>Four female backpackers killed in horror highway crash</span>
 <span>The court also heard he had earned the title of a serial traffic offender.</span>,
 <span>In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs.</span>
 <span>Watfa will serve at least two years and three months for manslaughter.</span>
 <span>He will be eligible for parole in early 2024.</span>
'''

soup=BeautifulSoup(html,'html.parser')

for span in soup.select('span:-soup-contains("READ MORE:")'):
    span.extract()
    for span in soup.select('span:-soup-contains("Long queues form at airports")'):
         span.extract()
         for span in soup.select('span:-soup-contains("Four female backpackers killed")'):
            span.extract()

print(soup)

Output:

<span>Two cars collided at low speed in Lurnea on February 25, 2019.</span>
<span>The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa.</span>


<span>Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred.</span>,
 <span>The baby boy suffered fatal injuries when the driver's airbag deployed.</span>     
<span>A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care".</span>


<span>The court also heard he had earned the title of a serial traffic offender.</span>   
<span>In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs.</span>
<span>Watfa will serve at least two years and three months for manslaughter.</span>       
<span>He will be eligible for parole in early 2024.</span>

CodePudding user response:

Assuming you scrape the text of each article of a news site and you should change your strategy.

Clean the tree while .decompose() the elements you do not wanna scrape:

for e in soup.select('span:-soup-contains("READ MORE")'):
    e.find_next('span').decompose()
    e.decompose()

than select body of the article and extract the text:

soup.select_one('.article__body-croppable').get_text(' ', strip=True)

This results in:

A driver has been jailed over the death of a baby boy who was sitting on his lap during a crash in Sydney's south-west . Two cars collided at low speed in Lurnea on February 25, 2019. The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa. Peter Watfa has been jailed for at least two years and three months. (9News) Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred. The baby boy suffered fatal injuries when the driver's airbag deployed. A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care". An 11-month-old boy died in the crash. (9News) The court also heard he had earned the title of a serial traffic offender. In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs. Watfa will serve at least two years and three months for manslaughter. He will be eligible for parole in early 2024.


Indeed you also could iterate your ResultSet and create a new list with all valid <span> but I think that is not the best option:

[x for i, x in enumerate(results) if 'READ MORE' not in x.text and 'READ MORE' not in results[i-1].text]
  • Related