Trying to build an iteration that goes through weblinks in order to extract the integer that is in the p tag that follows the p tag that contains the words "AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON". The target pages have close enough patterns but are not identical. That AGGREGATE AMOUNT sentence is not formatted identically (sometimes capitalized, sometimes not, and spacing might not be identical). I included 3 examples below on the section I am trying to extract the integer value from.
Below are the steps I am trying to achieve:
1-Iterate through links Some links: https://www.sec.gov/Archives//edgar/data/29669/000119312521353247/d228949dsc13da.htm https://www.sec.gov/Archives//edgar/data/1482018/000119312521353071/d258656dsc13d.htm https://www.sec.gov/Archives//edgar/data/1590854/000119312521353064/d210475dsc13d.htm https://www.sec.gov/Archives//edgar/data/1021944/000119312521353050/d243715dsc13g.htm https://www.sec.gov/Archives//edgar/data/1609492/000119312521353003/d251372dsc13da.htm ...
2-Find p tag that contains partial words "AGGREGATE AMOUNT BENEFICIALLY", which can be small caps , titled, or all caps
<p style="margin-top:0pt; margin-bottom:0pt; font-size:10pt; font-family:Times New Roman"> Aggregate Amount Beneficially Owned by Each Reporting Person:</p>
3-Once the p tag found, continue to look in the following p tags until we get to the first p tag where the string (if exists), stripped of its commas is an integer (not empty).This can be the immediate p tag that follows, or the one after as in the code below.
<p style="font-size:12pt; margin-top:0pt; margin-bottom:0pt"> </p><p style="margin-top:0pt; margin-bottom:1pt; font-size:10pt; font-family:Times New Roman"> 10,927,100*</p>
10,927,100 is the number I need, although it needs to be stripped from its commas (I know we can use the code below), but the string needs to be stripped from any other characters that are not numbers as well...
int(shares.replace(',', '')
example1:
<table border="0" cellpadding="0" cellspacing="0" style="BORDER-COLLAPSE:COLLAPSE; font-family:Times New Roman; font-size:10pt" width="98%">
<tr>
<td width="3%"></td>
<td valign="bottom" width="1%"></td>
<td width="6%"></td>
<td valign="bottom" width="1%"></td>
<td></td>
<td valign="bottom" width="1%"></td>
<td width="88%"></td></tr>
<tr style="page-break-inside:avoid ; font-family:Times New Roman; font-size:10pt">
<td style="BORDER-LEFT:1px solid #000000; BORDER-TOP:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-left:8pt" valign="top">11. </td>
<td style=" BORDER-LEFT:1px solid #000000; BORDER-TOP:1px solid #000000; BORDER-BOTTOM:1px solid #000000" valign="bottom"> </td>
<td colspan="5" style="BORDER-TOP:1px solid #000000; BORDER-RIGHT:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-right:2pt" valign="top">
<p style="margin-top:0pt; margin-bottom:0pt; font-size:10pt; font-family:Times New Roman"> Aggregate Amount Beneficially Owned by Each Reporting Person:</p> <p style="font-size:12pt; margin-top:0pt; margin-bottom:0pt"> </p>
<p style="margin-top:0pt; margin-bottom:1pt; font-size:10pt; font-family:Times New Roman"> 10,927,100*</p></td></tr>
<tr style="page-break-inside:avoid ; font-family:Times New Roman; font-size:10pt">
<td style="BORDER-LEFT:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-left:8pt" valign="top">12.</td>
<td style=" BORDER-LEFT:1px solid #000000; BORDER-BOTTOM:1px solid #000000" valign="bottom"> </td>
<td colspan="5" style="BORDER-RIGHT:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-right:2pt" valign="top"> <p style="margin-top:0pt; margin-bottom:0pt; font-size:10pt; font-family:Times New Roman"> Check if the Aggregate
Amount in Row (11) Excludes Certain Shares (See Instructions):</p> <p style="font-size:12pt;margin-top:0pt;margin-bottom:0pt"> </p>
example2:
<p style="font-size:12pt; margin-top:0pt; margin-bottom:0pt"> </p> <p style="margin-top:0pt; margin-bottom:1pt; font-size:12pt; font-family:Times New Roman">28,911,268(1)</p></td></tr>
<tr style="page-break-inside:avoid ; font-family:Times New Roman; font-size:12pt">
<td style="BORDER-LEFT:1px solid #000000; BORDER-TOP:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-left:8pt" valign="top"><font style="font-size:10pt">11</font></td>
<td style=" BORDER-LEFT:1px solid #000000; BORDER-TOP:1px solid #000000; BORDER-BOTTOM:1px solid #000000" valign="bottom"> </td>
<td colspan="5" style="BORDER-TOP:1px solid #000000; BORDER-RIGHT:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-right:2pt" valign="top">
<p style="margin-top:0pt; margin-bottom:0pt; font-size:10pt; font-family:Times New Roman">AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON</p> <p style="font-size:12pt; margin-top:0pt; margin-bottom:0pt"> </p>
<p style="margin-top:0pt; margin-bottom:1pt; font-size:12pt; font-family:Times New Roman">28,911,268</p></td></tr>
<tr style="page-break-inside:avoid ; font-family:Times New Roman; font-size:12pt">
example3:
<table border="0" cellpadding="0" cellspacing="0" style="BORDER-COLLAPSE:COLLAPSE; font-family:Times New Roman; font-size:10pt" width="100%">
<tr>
<td width="3%"></td>
<td valign="bottom" width="1%"></td>
<td width="6%"></td>
<td valign="bottom" width="1%"></td>
<td></td>
<td valign="bottom" width="1%"></td>
<td width="88%"></td></tr>
<tr style="page-break-inside:avoid ; font-family:Times New Roman; font-size:10pt">
<td style="BORDER-LEFT:1px solid #000000; BORDER-TOP:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-left:8pt" valign="top">11 </td>
<td style=" BORDER-LEFT:1px solid #000000; BORDER-TOP:1px solid #000000; BORDER-BOTTOM:1px solid #000000" valign="bottom"> </td>
<td colspan="5" style="BORDER-TOP:1px solid #000000; BORDER-RIGHT:1px solid #000000; BORDER-BOTTOM:1px solid #000000; padding-right:2pt" valign="top">
<p style="margin-top:0pt; margin-bottom:0pt; font-size:10pt; font-family:Times New Roman"> AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON</p> <p style="font-size:12pt; margin-top:0pt; margin-bottom:0pt"> </p><p style="margin-top:0pt; margin-bottom:1pt; font-size:10pt; font-family:Times New Roman"> 23,866,091</p></td></tr><tr style="page-break-inside:avoid ; font-family:Times New Roman; font-size:10pt">
For the time being, I am trying to collect all possible string forms of the sentence "AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON", instead of searching for a partial string that can apply to all.
My current code so far:
def doGet(url):
s = requests.Session()
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Sec-CH-UA': 'Examplary Browser',
'Sec-CH-UA-Mobile': '?0',
'Sec-CH-UA-Platform': "Windows",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"authority": "www.sec.gov",
"method": "GET",
"path": "/Archives/edgar/data/59478/000120919121046268/0001209191-21-046268-index.htm",
"scheme": "https",
"accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip deflate br",
"accept-language": "en-US,en;q=0.9,ar-LB;q=0.8,ar;q=0.7",
"cache-control": "max-age=0"}
r = None
proxies = {
'http': 'socks5://127.0.0.1:9050',
'https': 'socks5://127.0.0.1:9050'
}
try:
r = s.get(url, headers=headers)
if r.status_code == 403:
i = 1
while r.status_code == 403 and i <= 10:
print("Status code: " str(r.status_code) " for : " url)
time.sleep(1)
r = s.get(url, headers=headers, proxies = proxies)
i = 1
# print("Status code: " str(r.status_code) " for : " url)
# r.raise_for_status()
except requests.exceptions.HTTPError as errh:
print("Http Error: " errh " for url: " url)
if r.get_status_code == 403:
i = 1
while r.status_code == 403 and i <= 10:
time.sleep(i)
r = s.get(url, headers=headers)
i = 1
except requests.exceptions.ConnectionError as errc:
print("Error Connecting: " str(errc) " for : " url)
except requests.exceptions.Timeout as errt:
print("Timeout Error: " str(errt) " for : " url)
except requests.exceptions.RequestException as err:
print("OOps: Something Else" str(err) " for : " url)
return r
for link in link_list:
res = doGet(link)
soup = BeautifulSoup(res.text, 'html.parser')
# Finding a pattern(certain text)
try:
pattern = '9. AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON'
# Anchor tag
text1 = soup.find('p', text = pattern)
# print(text1)
shares = text1.find_next_sibling('p').string
shares
except Exception as e:
print('test1' str(e))
pass
try:
pattern = " AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON"
# Anchor tag
text1 = soup.find('p', text = pattern)
# print(text1)
shares =text1.find_next_sibling('p').find_next_sibling('p').string
print(shares)
except Exception as e:
print('test2' str(e))
pass
try:
pattern = "AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON"
# Anchor tag
text1 = soup.find('p', text = pattern)
# print(text1)
shares =text1.find_next_sibling('p').find_next_sibling('p').string
print(shares)
except Exception as e:
print('test3' str(e))
pass
try:
pattern = " Aggregate Amount Beneficially Owned by Each Reporting Person:"
# Anchor tag
text1 = soup.find('p', text = pattern)
# print(text1)
shares =text1.find_next_sibling('p').find_next_sibling('p').string
print(shares)
except Exception as e:
print('test4' str(e))
pass
try:
pattern = '9. AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON'
# Anchor tag
text1 = soup.find('p', text = pattern)
# print(text1)
shares = text1.find_next_sibling('p').string
shares
except Exception as e:
print('test1' str(e))
pass
CodePudding user response:
You should be able to adapt this. The pages you're visiting are very plain HTML (no Javascript) so it's very simple and efficient just to process the page in a simple loop.
What happens here is that we look for a p tag with text matching "AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON" in a case-insensitive manner. If we find that, we set a flag. Then on subsequent iterations of the p tags found by BeautifulSoup, we look for text that is a) not empty and b) contains a sequence of digits (ignoring commas).
And so on...
import requests
from bs4 import BeautifulSoup as BS
import re
URLS = ['https://www.sec.gov/Archives//edgar/data/29669/000119312521353247/d228949dsc13da.htm',
'https://www.sec.gov/Archives//edgar/data/1482018/000119312521353071/d258656dsc13d.htm',
'https://www.sec.gov/Archives//edgar/data/1590854/000119312521353064/d210475dsc13d.htm',
'https://www.sec.gov/Archives//edgar/data/1021944/000119312521353050/d243715dsc13g.htm',
'https://www.sec.gov/Archives//edgar/data/1609492/000119312521353003/d251372dsc13da.htm']
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
"Accept-Language": "en-US,en;q=0.5"
}
AGG = 'AGGREGATE AMOUNT BENEFICIALLY'
with requests.Session() as session:
for url in URLS:
try:
(r := session.get(url, headers=HEADERS)).raise_for_status()
print(url)
soup = BS(r.text, 'lxml')
np = False
for p in soup.find_all('p'):
if np:
if (t := p.text.strip()):
if (m := re.search(r'(\d )', t.replace(',', ''))):
print(f'\t{m[1]}')
np = False
else:
if AGG in p.text.upper():
np = True
except Exception as e:
print(f'Error while processing {url} -> {e}')
CodePudding user response:
a version with css selectors, i'll let you handle the formatting too. you will have to update the selector if you encounter different casings but the syntax is really simple
import requests
from bs4 import BeautifulSoup
links=[
'https://www.sec.gov/Archives//edgar/data/29669/000119312521353247/d228949dsc13da.htm',
'https://www.sec.gov/Archives//edgar/data/1482018/000119312521353071/d258656dsc13d.htm',
'https://www.sec.gov/Archives//edgar/data/1590854/000119312521353064/d210475dsc13d.htm',
'https://www.sec.gov/Archives//edgar/data/1021944/000119312521353050/d243715dsc13g.htm',
'https://www.sec.gov/Archives//edgar/data/1609492/000119312521353003/d251372dsc13da.htm']
ua={"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_6 like Mac OS X) AppleWebKit/535.2 (KHTML, like Gecko) FxiOS/9.5o4548.0 Mobile/38G031 Safari/535.2"}
with requests.session() as s:
s.headers.update(ua)
for link in links:
r=s.get(link)
print(r.status_code, r.reason, link)
soup=BeautifulSoup(r.text, 'lxml')
ps=soup.select('''td[colspan="5"]:has(p:-soup-contains(
"Aggregate Amount Beneficially",
"AGGREGATE AMOUNT BENEFICIALLY"
)) p:last-of-type''')
for p in ps:
print(p.text)
CodePudding user response:
Close to diggusbickus approache this answer will use css selectors
with requests
and BeautifulSoup
only, so you do not need to import re library
It follows the process by searching for the pattern
while using python f-string
we can also use .upper()
to modify the pattern
. My selection is focused on the <p>
that contains the pattern
and its last sibling <p>
soup.select(f'''p:-soup-contains("{pattern}","{pattern.upper()}") ~ p:last-of-type''')
To extract the digits only and get an int()
I use the .filter()
method:
int("".join(filter(str.isdigit, p.text.split("(")[0])))
Note To take care of digits in () we have to split the string before filtering
Example
import requests
from bs4 import BeautifulSoup
links=[
'https://www.sec.gov/Archives//edgar/data/29669/000119312521353247/d228949dsc13da.htm',
'https://www.sec.gov/Archives//edgar/data/1482018/000119312521353071/d258656dsc13d.htm',
'https://www.sec.gov/Archives//edgar/data/1590854/000119312521353064/d210475dsc13d.htm',
'https://www.sec.gov/Archives//edgar/data/1021944/000119312521353050/d243715dsc13g.htm',
'https://www.sec.gov/Archives//edgar/data/1609492/000119312521353003/d251372dsc13da.htm'
]
headers={"User-Agent": "Mozilla/5.0"}
pattern = 'Aggregate Amount Beneficially'
for link in links:
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text, 'lxml')
for p in soup.select(f'''p:-soup-contains("{pattern}","{pattern.upper()}") ~ p:last-of-type'''):
text = p.text.strip().split("(")[0]
if any(char.isdigit() for char in text):
print(f'{link}\n{p.text}\n{int("".join(filter(str.isdigit, text)))}')
Output
The ouput should show the conversion from string
to int
- url is only to control results:
...
https://www.sec.gov/Archives//edgar/data/1609492/000119312521353003/d251372dsc13da.htm
1,100,037(2)
1100037
...