Home > Blockchain >  How to scrape text from HTML excluding tags, that contains specific pattern?
How to scrape text from HTML excluding tags, that contains specific pattern?

Time:09-02

I am trying to get the information from here

https://www.philips.com/a-w/security/security-advisories

I want each category article to be assigned to a dataframe

So for example First article-->first row of the datafrane, second article -->second row of the datafrane..etc

For the beginning I am trying the following code:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from requests.models import ContentDecodingError

url = "https://www.philips.com/a-w/security/security-advisories"
# get the html content of the url
html_content = requests.get(url).text
# parse the html content
soup = BeautifulSoup(html_content, 'html.parser')
span_content = []
for span in soup.find_all("span", class_="p-body-copy-02"):
    span_content.append(span.text)

The span_content contains information in the following form where each new category is starting after the publication and update date fields:

['Publication Date:\xa02022 August 25',
 'Update Date: 2022 August 25',
 'Philips is currently monitoring... specific to their Philips’ products.',
  Publication Date:\xa02022 August 18',
 'Update Date: 2022 August 18',
 'Philips is currently monitoring...specific to their Philips’ products.',

etc]

I am trying the following code to get rid of the publication date and update date:

def delete_date(span_content):
    for i in range(len(span_content)):
        if span_content[i] == 'Publication Date:' or span_content[i] == 'Update Date:':
            span_content.pop(i)
            break
    return span_content
delete_date(span_content)

However this is working.

So how do I get rid of the publication date and update date and cast the information into a dataset?

index  Info
0      Philips is currently monitoring... specific to their Philips’ products
1      Philips is currently monitoring...specific to their Philips’ products
...    ...
etc    etc

CodePudding user response:

You could apply if continue statement or simple(not robust way) list slicing

import pandas as pd
import requests
from bs4 import BeautifulSoup
from requests.models import ContentDecodingError

url = "https://www.philips.com/a-w/security/security-advisories"
# get the html content of the url
html_content = requests.get(url).text
# parse the html content
soup = BeautifulSoup(html_content, 'html.parser')

span_content = []
for span in soup.find_all("span", class_="p-body-copy-02"):
    if 'Publication Date:' in span or 'Update Date:' in span:
           continue
    else:
        span_content.append(span.text)

print(span_content)



# span_content = []
# for span in soup.find_all("span", class_="p-body-copy-02")[2:]:
#     span_content.append(span.text)

# print(span_content)

CodePudding user response:

Alternativly you could select your elements a way more specific with a css selector:

soup.select('span.p-body-copy-02:not(:-soup-contains(" Date:"))')

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.philips.com/a-w/security/security-advisories"
html_content = requests.get(url).text

soup = BeautifulSoup(html_content, 'html.parser')

data = []

for e in soup.select('dd'):
    data.append({
        'title':e.find_previous_sibling('dt').text,
        'content':' '.join([e.get_text(' ', strip=True) for e in e.select('span.p-body-copy-02:not(:-soup-contains(" Date:"))')])
    })

pd.DataFrame(data)

Output

title content
0 Realtek Advisory (CVE-2022-27255) - (2022 August 25) Philips is currently monitoring developments and updates related to the Realtek AP-Router SDK Advisory (CVE-2022-27255). Realtek has confirmed that their eCos SDK-based routers, the ‘SIP ALG’ module is vulnerable to buffer overflow. Successful execution of this vulnerability could allow a crash or achieve the remote execution code. Realtek has released ... information specific to their Philips’ products.
1 Cisco Advisory (CVE-2022-20866) - (2022 August 18) Philips is currently monitoring developments and updates related to the recently released Cisco advisory . Cisco has confirmed a critical vulnerability (CVE-2022-20866) exists in the handling of RSA keys on devices running Adaptive Security Appliance (ASA) Software and Firepower ... information specific to their Philips’ products.

...

CodePudding user response:

I think your problem is with the if statement try this function

def delete_date(span_content):
   filtered_span_content = []
   for i in range(len(span_content)):
       if 'Publication Date:' in span_content[i] or 'Update Date:' in span_content[i]:
           continue
       else:
           filtered_span_content.append(span_content[i])
   return filtered_span_content

CodePudding user response:

The following should work on your setup:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

url = 'https://www.philips.com/a-w/security/security-advisories'

big_list = []
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
parent_div = soup.select_one('dl.p-accordion')
titles = parent_div.select('dt')
for t in titles:
    advisory = t.find_next_sibling('dd')
    try:
        advisory.find('span', string = re.compile("Publication Date")).decompose()
        advisory.find('span', string = re.compile("Update Date")).decompose()
        
    except Exception as e:
        try:
            advisory.find('strong', string = re.compile("Publication Date")).decompose()
            advisory.find('strong', string = re.compile("Update Date")).decompose()
        except Exception as e:
            advisory.find('p', string = re.compile("Publication Date")).decompose()
            advisory.find('p', string = re.compile("Update Date")).decompose()
            
    big_list.append((t.text, advisory.get_text(strip=True)))
    

df = pd.DataFrame(big_list, columns = ['Title', 'Description'])
print(df)

Result:

Title   Description
0   Realtek Advisory (CVE-2022-27255) - (2022 August 25)    Philips is currently monitoring developments and updates related to the Realtek AP-Router SDK Advisory (CVE-2022-27255). Realtek has confirmed that their eCos SDK-based routers, the ‘SIP ALG’ module is vulnerable to buffer overflow.Successful execution of this vulnerability could allow a crash or achieve the remote execution code. Realtek has released a patch that remediate this vulnerability.At this time, no Philips products are known to be impacted. In accordance with Philips’ Global Security Policy, Philips continues to analyze the matter, and further information will be posted on the Philips Product Security Advisory page as appropriate. Philips is committed to ensuring the safety, security, integrity, and regulatory compliance of our products to be deployed and to operate within Philips approved product specifications. Therefore, in accordance with Philips’s policy and regulatory requirements, all changes of configuration or software to Philips’ products (including operating system security updates and patches) may be implemented only in accordance with Philips’s product-specific, verified & validated, authorized, and communicated customer procedures or field actions. If a product does require operating system security updates, configuration changes, or other actions to be taken by our customer or by Philips Customer Services, product-specific service documentation will be produced by Philips’s product teams and made available to Philips service delivery platforms such as the Philips InCenter Customer Portal.Contract-entitled customers may use Philips InCenter and are encouraged to request Philips InCenter access and reference product-specific information posted. If customers still have questions, all customers (contract-entitled or otherwise) are encouraged to contact their local service support team or regional product service support as appropriate for up-to-date information specific to their Philips’ products.
1   Cisco Advisory (CVE-2022-20866) - (2022 August 18)  Philips is currently monitoring developments and updates related to the recently released Ciscoadvisory. Cisco has confirmed a critical vulnerability (CVE-2022-20866) exists in the handling of RSA keys on devices running Adaptive Security Appliance (ASA) Software and Firepower Threat Defense (FTD) Software.Successful execution of this vulnerability could allow an unauthenticated, remote attacker to retrieve an RSA private key. Cisco has released software updates that help remediate this vulnerability.At this time, no Philips products are known to be impacted. In accordance with Philips’ Global Security Policy, Philips continues to analyze the matter, and further information will be posted on the Philips Product Security Advisory page as appropriate.Philips is committed to ensuring the safety, security, integrity, and regulatory compliance of our products to be deployed and to operate within Philips approved product specifications. Therefore, in accordance with Philips’s policy and regulatory requirements, all changes of configuration or software to Philips’ products (including operating system security updates and patches) may be implemented only in accordance with Philips’s product-specific, verified & validated, authorized, and communicated customer procedures or field actions.If a product does require operating system security updates, configuration changes, or other actions to be taken by our customer or by Philips Customer Services, product-specific service documentation will be produced by Philips’s product teams and made available to Philips service delivery platforms such as the Philips InCenter Customer Portal.Contract-entitled customers may use Philips InCenter and are encouraged to request Philips InCenter access and reference product-specific information posted. If customers still have questions, all customers (contract-entitled or otherwise) are encouraged to contact their local service support team or regional product service support as appropriate for up-to-date information specific to their Philips’ products.
[...]

HedgeHog's solution is more elegant tho, and it should work if you would install/update soupsieve. Documentation for soupsieve: https://facelessuser.github.io/soupsieve/

And for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

  • Related