I have been trying to automate this link to get the email address with selenium. I have used this XPATH //span[@]/a/@href
which is perfectly find but selenium doesn't extract the value from there.
I also user Regex but it didn't work as well re.findall(r'mailto:(.*?)\?sub', str(driver.page_source))
Can anyone tell what's the issue here? why it's not getting the emails and how can I extract it?
from selenium import webdriver
from scrapy.selector import Selector
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import re
driver = webdriver.Chrome()
driver.get('https://www.ukparks.com/park/haighfield-park/')
WebDriverWait(driver, 7).until(
EC.presence_of_element_located((By.XPATH, '//span[@]'))
)
response = Selector(text=driver.page_source)
email = response.xpath('//span[@]/a/@href').get()
email_re = re.findall(r'mailto:(.*?)\?sub', str(driver.page_source))
print(email)
print(email_re)
CodePudding user response:
It seems to populate the data after a click event on the a tag.
wait=WebDriverWait(driver, 10)
driver.get('https://www.ukparks.com/park/haighfield-park/')
wait.until(EC.element_to_be_clickable((By.XPATH, '//span[@]/a'))).click()
link=wait.until(EC.element_to_be_clickable((By.XPATH, '//span[@]/a'))).get_attribute("href")
print(link)
Outputs
mailto:[email protected]?subject=Enquiry from UKParks.com
CodePudding user response:
You can try like the following to fetch email from that site using requests:
import re
import requests
from bs4 import BeautifulSoup
link = 'https://www.ukparks.com/park/haighfield-park/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one(".detail-box").get_text(strip=True)
email_raw = re.findall(r"ehArr\.push\('(.*?)'\);",item)
email = ''.join(email_raw[::-1])
print(email)
Output:
[email protected]