I'm trying to scrape an email address from this webpage using requests module, not selenium. Although the email address is obfuscated and not present in page source, a javascript function generates this. How can I make use of the following portion to get the email address visible in that webpage?
document.write("\u003cn uers=\"znvygb:[email protected]\"\[email protected]\u003c/n\u003e".replace(/[a-zA-Z]/g, function(c){return String.fromCharCode((c<="Z"?90:122)>=(c=c.charCodeAt(0) 13)?c:c-26);}));
I've tried so far with:
import requests
from bs4 import BeautifulSoup
link = 'https://www.californiatoplawyers.com/lawyer/311805/tobyn-yael-aaron'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
email = soup.select_one("dt:-soup-contains('Email') dd")
print(email)
Expected output:
[email protected]
CodePudding user response:
For these tasks I recommend js2py
module:
import js2py
import requests
from bs4 import BeautifulSoup
link = "https://www.californiatoplawyers.com/lawyer/311805/tobyn-yael-aaron"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}
res = requests.get(link, headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
email = soup.select_one("dt:-soup-contains('Email') dd")
js_code = email.script.contents[0].replace("document.write", "")
email = BeautifulSoup(js2py.eval_js(js_code), "html.parser").text
print(email)
Prints:
[email protected]