My organization needs me to authenticate a two factor authentication to scrape an internal website. Every time when i open a browser it will ask for an authentication . The authentication cookie is stored in c://users//.way//cookie.bat
. I want to use this cookie file to scrape an internal website . can some one help me in this?
sample program
from bs4 import BeautifulSoup
import requests
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
cookie=c://users//.way//cookie.bat # cookie variable should read the contents in the cookie file and pass it in requests
source=requests.get('https://www.internalwebsite.com',headers=header,cookie=cookies)
soup=BeautifulSoup(source,'lxml')
### general scraping
I tried reading the cookie file but i am unable to do that. kindly help me in reading the cookie file and pass it in requests so that i can access internal website through BeautifulSoup
CodePudding user response:
BeautifulSoup will not handle cookies, instead it's requests job. Automatically parsing cookies from a file and adding them to your request session is going to be a bit a complicated but the general idea would be:
- read the contents of the file with
open
- parse the file to a python dict (this depends on the format of your cookies file)
- create a request session with your cookies.
- use the session to get the website.
It might be easier to just hardcode the authentication cookies in your code ( session.auth = ('user', 'pass')
) if you don't need to update them too often.
CodePudding user response:
Cookie is just a ";" separated string in key=value format. You can use Python's built-in SimpleCookie.
from Cookie import SimpleCookie
cookie = SimpleCookie(<cookie.bat-contents>)
cookies = {k: v.value for k, v in cookie.iteritems()}
requests.get(url, cookies=cookies)