I want to scrape this website: https://cage.dla.mil/Home/UsageAgree using Beautiful Soup. What I'm doing:
import requests
url = "https://cage.dla.mil/Home/UsageAgree"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup)
which returns HTML from a cookie agreement page. What I am then looking for is to bypass this to scrape the content of the actual page once we accept the cookies.
I followed this post: Scraping a webpage using Python (beautiful soup) that requires "I agree to cookies" button being clicked?
and did:
import requests
url = 'https://cage.dla.mil/'
s = requests.Session()
s.cookies.update({'agree': 'True'})
s.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup)
but I'm still getting the agreement page.
It seems that one of the cookies always gives a unique value. I'm not sure how to deal with this.
CodePudding user response:
Well, this should work.
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0"
}
with requests.Session() as s:
token = (
BeautifulSoup(
s.get(
"https://cage.dla.mil/Home/UsageAgree",
headers=headers,
).text,
"lxml",
).select_one("form input")["value"]
)
payload = {
"__RequestVerificationToken": token,
"returningURL": "",
}
_ = s.post(
"https://cage.dla.mil/Home/UsageAgree",
data=payload,
headers=headers
)
soup = (
BeautifulSoup(
s.get("https://cage.dla.mil/", headers=headers).text,
"lxml",
).select("#briefnewslist > div > p > em")
)
print("\n".join(p.getText(strip=True) for p in soup))
Output:
Scheduled Maintenance
SAM Validation: Unable To Find A Matching Entity When Asked To Enter Or Validate My Entity Information
SAM Validation: Continue A Registration Update Or Renewal If Validation Fails
SAM.gov Registration for Financial Assistance
Financial Assistance Update
CAGE Expiration Date