Find a p tag with a partial string using beautifulsoup, and extract the integer in the string of the-CodePudding

Trying to build an iteration that goes through weblinks in order to extract the integer that is in the p tag that follows the p tag that contains the words "AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON". The target pages have close enough patterns but are not identical. That AGGREGATE AMOUNT sentence is not formatted identically (sometimes capitalized, sometimes not, and spacing might not be identical). I included 3 examples below on the section I am trying to extract the integer value from.

Below are the steps I am trying to achieve:

1-Iterate through links Some links: https://www.sec.gov/Archives//edgar/data/29669/000119312521353247/d228949dsc13da.htm https://www.sec.gov/Archives//edgar/data/1482018/000119312521353071/d258656dsc13d.htm https://www.sec.gov/Archives//edgar/data/1590854/000119312521353064/d210475dsc13d.htm https://www.sec.gov/Archives//edgar/data/1021944/000119312521353050/d243715dsc13g.htm https://www.sec.gov/Archives//edgar/data/1609492/000119312521353003/d251372dsc13da.htm ...

2-Find p tag that contains partial words "AGGREGATE AMOUNT BENEFICIALLY", which can be small caps , titled, or all caps

 Aggregate Amount Beneficially Owned by Each Reporting Person:

3-Once the p tag found, continue to look in the following p tags until we get to the first p tag where the string (if exists), stripped of its commas is an integer (not empty).This can be the immediate p tag that follows, or the one after as in the code below.

 10,927,100*

10,927,100 is the number I need, although it needs to be stripped from its commas (I know we can use the code below), but the string needs to be stripped from any other characters that are not numbers as well... int(shares.replace(',', '')

example1:

<table border="0" cellpadding="0" cellspacing="0" style="BORDER-COLLAPSE:COLLAPSE; font-family:Times New Roman; font-size:10pt" width="98%">
<tr>
<td width="3%"></td>
<td valign="bottom" width="1%"></td>
<td width="6%"></td>
<td valign="bottom" width="1%"></td>
<td></td>
<td valign="bottom" width="1%"></td>
<td width="88%"></td></tr>
<tr style="page-break-inside:avoid ; font-family:Times New Roman; font-size:10pt">
<td style="BORDER-LEFT:1px solid #000000; BORDER-TOP:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-left:8pt" valign="top">11.  </td>
<td style=" BORDER-LEFT:1px solid #000000; BORDER-TOP:1px solid #000000; BORDER-BOTTOM:1px solid #000000" valign="bottom"> </td>
<td colspan="5" style="BORDER-TOP:1px solid #000000; BORDER-RIGHT:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-right:2pt" valign="top">
<p style="margin-top:0pt; margin-bottom:0pt; font-size:10pt; font-family:Times New Roman">  Aggregate Amount Beneficially Owned by Each Reporting Person:</p> <p style="font-size:12pt; margin-top:0pt; margin-bottom:0pt"> </p>
<p style="margin-top:0pt; margin-bottom:1pt; font-size:10pt; font-family:Times New Roman">  10,927,100*</p></td></tr>
<tr style="page-break-inside:avoid ; font-family:Times New Roman; font-size:10pt">
<td style="BORDER-LEFT:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-left:8pt" valign="top">12.</td>
<td style=" BORDER-LEFT:1px solid #000000; BORDER-BOTTOM:1px solid #000000" valign="bottom"> </td>
<td colspan="5" style="BORDER-RIGHT:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-right:2pt" valign="top"> <p style="margin-top:0pt; margin-bottom:0pt; font-size:10pt; font-family:Times New Roman">  Check if the Aggregate
Amount in Row (11) Excludes Certain Shares (See Instructions):</p> <p style="font-size:12pt;margin-top:0pt;margin-bottom:0pt"> </p>

example2:

<p style="font-size:12pt; margin-top:0pt; margin-bottom:0pt"> </p> <p style="margin-top:0pt; margin-bottom:1pt; font-size:12pt; font-family:Times New Roman">28,911,268(1)</p></td></tr>
<tr style="page-break-inside:avoid ; font-family:Times New Roman; font-size:12pt">
<td style="BORDER-LEFT:1px solid #000000; BORDER-TOP:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-left:8pt" valign="top"><font style="font-size:10pt">11</font></td>
<td style=" BORDER-LEFT:1px solid #000000; BORDER-TOP:1px solid #000000; BORDER-BOTTOM:1px solid #000000" valign="bottom"> </td>
<td colspan="5" style="BORDER-TOP:1px solid #000000; BORDER-RIGHT:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-right:2pt" valign="top">
<p style="margin-top:0pt; margin-bottom:0pt; font-size:10pt; font-family:Times New Roman">AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON</p> <p style="font-size:12pt; margin-top:0pt; margin-bottom:0pt"> </p>
<p style="margin-top:0pt; margin-bottom:1pt; font-size:12pt; font-family:Times New Roman">28,911,268</p></td></tr>
<tr style="page-break-inside:avoid ; font-family:Times New Roman; font-size:12pt">

example3:

<table border="0" cellpadding="0" cellspacing="0" style="BORDER-COLLAPSE:COLLAPSE; font-family:Times New Roman; font-size:10pt" width="100%">
<tr>
<td width="3%"></td>
<td valign="bottom" width="1%"></td>
<td width="6%"></td>
<td valign="bottom" width="1%"></td>
<td></td>
<td valign="bottom" width="1%"></td>
<td width="88%"></td></tr>
<tr style="page-break-inside:avoid ; font-family:Times New Roman; font-size:10pt">
<td style="BORDER-LEFT:1px solid #000000; BORDER-TOP:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-left:8pt" valign="top">11  </td>
<td style=" BORDER-LEFT:1px solid #000000; BORDER-TOP:1px solid #000000; BORDER-BOTTOM:1px solid #000000" valign="bottom"> </td>
<td colspan="5" style="BORDER-TOP:1px solid #000000; BORDER-RIGHT:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-right:2pt" valign="top">
<p style="margin-top:0pt; margin-bottom:0pt; font-size:10pt; font-family:Times New Roman">  AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON</p> <p style="font-size:12pt; margin-top:0pt; margin-bottom:0pt"> </p><p style="margin-top:0pt; margin-bottom:1pt; font-size:10pt; font-family:Times New Roman">  23,866,091</p></td></tr><tr style="page-break-inside:avoid ; font-family:Times New Roman; font-size:10pt">

For the time being, I am trying to collect all possible string forms of the sentence "AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON", instead of searching for a partial string that can apply to all.

My current code so far:

def doGet(url):
    s = requests.Session()
    headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
        'Sec-CH-UA': 'Examplary Browser',
        'Sec-CH-UA-Mobile': '?0',
        'Sec-CH-UA-Platform': "Windows",
        "sec-fetch-dest": "document",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "none",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "authority": "www.sec.gov",
        "method": "GET",
        "path": "/Archives/edgar/data/59478/000120919121046268/0001209191-21-046268-index.htm",
        "scheme": "https",
        "accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip deflate br",
        "accept-language": "en-US,en;q=0.9,ar-LB;q=0.8,ar;q=0.7",
        "cache-control": "max-age=0"}


    r = None
    proxies = {
        'http': 'socks5://127.0.0.1:9050',
        'https': 'socks5://127.0.0.1:9050'
    }
    try:
        r = s.get(url, headers=headers)

        if r.status_code == 403:
            i = 1
            while r.status_code == 403 and i <= 10:
                print("Status code: "   str(r.status_code)   "  for : "   url)
                time.sleep(1)
                r = s.get(url, headers=headers, proxies = proxies)
                i  = 1
        # print("Status code: "   str(r.status_code)   "  for : "   url)
        # r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        print("Http Error: "   errh   " for url: "   url)
        if r.get_status_code == 403:
            i = 1
            while r.status_code == 403 and i <= 10:
                time.sleep(i)
                r = s.get(url, headers=headers)
                i  = 1
    except requests.exceptions.ConnectionError as errc:
        print("Error Connecting: "   str(errc)   "  for : "   url)
    except requests.exceptions.Timeout as errt:
        print("Timeout Error: "   str(errt)   "  for : "   url)
    except requests.exceptions.RequestException as err:
        print("OOps: Something Else"   str(err)   "  for : "   url)

    return r


for link in link_list:
    res = doGet(link)
    soup = BeautifulSoup(res.text, 'html.parser')
    # Finding a pattern(certain text)

    try:
        pattern = '9.  AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON' 
        # Anchor tag
        text1 = soup.find('p', text = pattern)
        # print(text1)
        shares = text1.find_next_sibling('p').string
        shares
    except Exception as e:
        print('test1'   str(e))
        pass

    try:
        pattern = "  AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON" 
        # Anchor tag
        text1 = soup.find('p', text = pattern)
        # print(text1)
        shares =text1.find_next_sibling('p').find_next_sibling('p').string
        print(shares)
    except Exception as e:
        print('test2'   str(e))
        pass

    try:
        pattern = "AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON" 
        # Anchor tag
        text1 = soup.find('p', text = pattern)
        # print(text1)
        shares =text1.find_next_sibling('p').find_next_sibling('p').string
        print(shares)
    except Exception as e:
        print('test3'   str(e))
        pass

    try:
        pattern = "  Aggregate Amount Beneficially Owned by Each Reporting Person:" 
        # Anchor tag
        text1 = soup.find('p', text = pattern)
        # print(text1)
        shares =text1.find_next_sibling('p').find_next_sibling('p').string
        print(shares)
    except Exception as e:
        print('test4'   str(e))
        pass

    try:
        pattern = '9.  AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON' 
        # Anchor tag
        text1 = soup.find('p', text = pattern)
        # print(text1)
        shares = text1.find_next_sibling('p').string
        shares
    except Exception as e:
        print('test1'   str(e))
        pass

CodePudding user response：

You should be able to adapt this. The pages you're visiting are very plain HTML (no Javascript) so it's very simple and efficient just to process the page in a simple loop.

What happens here is that we look for a p tag with text matching "AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON" in a case-insensitive manner. If we find that, we set a flag. Then on subsequent iterations of the p tags found by BeautifulSoup, we look for text that is a) not empty and b) contains a sequence of digits (ignoring commas).

And so on...

import requests
from bs4 import BeautifulSoup as BS
import re

URLS = ['https://www.sec.gov/Archives//edgar/data/29669/000119312521353247/d228949dsc13da.htm',
        'https://www.sec.gov/Archives//edgar/data/1482018/000119312521353071/d258656dsc13d.htm',
        'https://www.sec.gov/Archives//edgar/data/1590854/000119312521353064/d210475dsc13d.htm',
        'https://www.sec.gov/Archives//edgar/data/1021944/000119312521353050/d243715dsc13g.htm',
        'https://www.sec.gov/Archives//edgar/data/1609492/000119312521353003/d251372dsc13da.htm']


HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    "Accept-Language": "en-US,en;q=0.5"
}

AGG = 'AGGREGATE AMOUNT BENEFICIALLY'

with requests.Session() as session:
    for url in URLS:
        try:
            (r := session.get(url, headers=HEADERS)).raise_for_status()
            print(url)
            soup = BS(r.text, 'lxml')
            np = False
            for p in soup.find_all('p'):
                if np:
                    if (t := p.text.strip()):
                        if (m := re.search(r'(\d )', t.replace(',', ''))):
                            print(f'\t{m[1]}')
                            np = False
                else:
                    if AGG in p.text.upper():
                        np = True
        except Exception as e:
            print(f'Error while processing {url} -> {e}')

CodePudding user response：

a version with css selectors, i'll let you handle the formatting too. you will have to update the selector if you encounter different casings but the syntax is really simple

import requests
from bs4 import BeautifulSoup

links=[
    'https://www.sec.gov/Archives//edgar/data/29669/000119312521353247/d228949dsc13da.htm',
    'https://www.sec.gov/Archives//edgar/data/1482018/000119312521353071/d258656dsc13d.htm',
    'https://www.sec.gov/Archives//edgar/data/1590854/000119312521353064/d210475dsc13d.htm',
    'https://www.sec.gov/Archives//edgar/data/1021944/000119312521353050/d243715dsc13g.htm',
    'https://www.sec.gov/Archives//edgar/data/1609492/000119312521353003/d251372dsc13da.htm']

ua={"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_6 like Mac OS X) AppleWebKit/535.2 (KHTML, like Gecko) FxiOS/9.5o4548.0 Mobile/38G031 Safari/535.2"}
with requests.session() as s:
    s.headers.update(ua)
    for link in links:
        r=s.get(link)
        print(r.status_code, r.reason, link)
        soup=BeautifulSoup(r.text, 'lxml')
        ps=soup.select('''td[colspan="5"]:has(p:-soup-contains(
                "Aggregate Amount Beneficially",
                "AGGREGATE AMOUNT BENEFICIALLY"     
            )) p:last-of-type''')
        for p in ps:
            print(p.text)

CodePudding user response：

Close to diggusbickus approache this answer will use css selectors with requests and BeautifulSoup only, so you do not need to import re library

It follows the process by searching for the pattern while using python f-string we can also use .upper() to modify the pattern. My selection is focused on the  that contains the pattern and its last sibling 

soup.select(f'''p:-soup-contains("{pattern}","{pattern.upper()}") ~ p:last-of-type''')

To extract the digits only and get an int() I use the .filter() method:

int("".join(filter(str.isdigit, p.text.split("(")[0])))

Note To take care of digits in () we have to split the string before filtering

Example

import requests
from bs4 import BeautifulSoup

links=[
    'https://www.sec.gov/Archives//edgar/data/29669/000119312521353247/d228949dsc13da.htm',
    'https://www.sec.gov/Archives//edgar/data/1482018/000119312521353071/d258656dsc13d.htm',
    'https://www.sec.gov/Archives//edgar/data/1590854/000119312521353064/d210475dsc13d.htm',
    'https://www.sec.gov/Archives//edgar/data/1021944/000119312521353050/d243715dsc13g.htm',
    'https://www.sec.gov/Archives//edgar/data/1609492/000119312521353003/d251372dsc13da.htm'
]

headers={"User-Agent": "Mozilla/5.0"}
pattern = 'Aggregate Amount Beneficially'

for link in links:
    r=requests.get(link,headers=headers)
    soup=BeautifulSoup(r.text, 'lxml')
    for p in soup.select(f'''p:-soup-contains("{pattern}","{pattern.upper()}") ~ p:last-of-type'''):
        text = p.text.strip().split("(")[0]
        if any(char.isdigit() for char in text):
            print(f'{link}\n{p.text}\n{int("".join(filter(str.isdigit, text)))}')

Output

The ouput should show the conversion from string to int - url is only to control results: ...

https://www.sec.gov/Archives//edgar/data/1609492/000119312521353003/d251372dsc13da.htm
1,100,037(2)
1100037

...