Webscraping websites with old unsupported Internet Explorer browser-CodePudding

I am trying to scrape the following website(https://iltacon2022.expofp.com/) and I keep receiving the following error (full output print below). I'm not sure what the issue is and I was wondering if someone could help me.

 if (window.navigator.userAgent.indexOf("Trident/") !== -1) {
                alert("Your are using old unsupported Internet Explorer browser.\nPlease upgrade to view this page properly."

I've tried using selenium and the requests module, but I seem to experience the same problem either way.

Code trials:

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
import random
import requests

options = Options()
options.headless = False
driver = webdriver.Firefox(options=options)

url = "https://iltacon2022.expofp.com/"

driver.get(url)

time.sleep(6)

soup = bs(driver.page_source, 'lxml')

driver.quit()

print(soup)

Output:

<html lang="en"><head>
<meta charset="utf-8"/>
<link href="https://iltacon2022.expofp.com/packages/master/favicon.png" rel="shortcut icon"/>
<meta content="user-scalable=no, initial-scale=1.0, maximum-scale=1.0, width=device-width" name="viewport"/>
<!-- <meta name="theme-color" content="#000000" /> -->
<title>ILTACON2022 – Gaylord National Resort and Convention Center | August 22–25, 2022 | Monday – Thursday – Expo Floor Plan by ExpoFP</title>
<script>
            if (window.navigator.userAgent.indexOf("Trident/") !== -1) {
                alert("Your are using old unsupported Internet Explorer browser.\nPlease upgrade to view this page properly.");
            }
        </script>
<style>
            html,
            body {
                touch-action: none;
                margin: 0;
                padding: 0;
                height: 100%;
                width: 100%;
                background: #ebebeb;
                position: fixed;
                overflow: hidden;
            }
            @media (max-width: 820px) and (min-width: 500px) {
                html {
                    font-size: 13px;
                }
            }
        </style>
<style>
            .lds-grid {
                top: 42vh;
                margin: 0 auto;
                display: block;
                position: relative;
                width: 64px;
                height: 64px;
            }

            .lds-grid div {
                position: absolute;
                width: 13px;
                height: 13px;
                background: #aaa;
                border-radius: 50%;
                /* border: solid 1px #fff; */
                animation: lds-grid 1.2s linear infinite;
            }

            .lds-grid div:nth-child(1) {
                top: 6px;
                left: 6px;
                animation-delay: 0s;
            }

            .lds-grid div:nth-child(2) {
                top: 6px;
                left: 26px;
                animation-delay: -0.4s;
            }

            .lds-grid div:nth-child(3) {
                top: 6px;
                left: 45px;
                animation-delay: -0.8s;
            }

            .lds-grid div:nth-child(4) {
                top: 26px;
                left: 6px;
                animation-delay: -0.4s;
            }

            .lds-grid div:nth-child(5) {
                top: 26px;
                left: 26px;
                animation-delay: -0.8s;
            }

            .lds-grid div:nth-child(6) {
                top: 26px;
                left: 45px;
                animation-delay: -1.2s;
            }

            .lds-grid div:nth-child(7) {
                top: 45px;
                left: 6px;
                animation-delay: -0.8s;
            }

            .lds-grid div:nth-child(8) {
                top: 45px;
                left: 26px;
                animation-delay: -1.2s;
            }

            .lds-grid div:nth-child(9) {
                top: 45px;
                left: 45px;
                animation-delay: -1.6s;
            }

            @keyframes lds-grid {
                0%,
                100% {
                    opacity: 1;
                }

                50% {
                    opacity: 0.5;
                }
            }
        </style>
<link as="script" href="https://iltacon2022.expofp.com/data/data.js" rel="preload"/>
<link as="script" href="https://iltacon2022.expofp.com/data/fp.svg.js" rel="preload"/>
<link as="script" href="https://iltacon2022.expofp.com/packages/master/floorplan.js" rel="preload"/>
<link as="script" href="https://iltacon2022.expofp.com/packages/master/vendors~floorplan.js" rel="preload"/>
<link as="style" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/css/fontawesome-all.min.css" rel="preload"/>
<link as="style" href="https://iltacon2022.expofp.com/packages/master/vendor/sanitize-css/sanitize.css" rel="preload"/>
<link as="style" href="https://iltacon2022.expofp.com/packages/master/vendor/perfect-scrollbar/css/perfect-scrollbar.css" rel="preload"/>
<!-- Fonts are anonymous because those will be loaded with FontFace -->
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/webfonts/fa-regular-400.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/webfonts/fa-solid-900.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/webfonts/fa-light-300.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/fonts/oswald-v17-cyrillic_latin-500.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/fonts/oswald-v17-cyrillic_latin-300.woff2" rel="preload"/>
<script src="https://iltacon2022.expofp.com/data/data.js"></script><script src="https://iltacon2022.expofp.com/data/wf.data.js"></script><script src="https://iltacon2022.expofp.com/data/fp.svg.js"></script><script charset="utf-8" src="https://iltacon2022.expofp.com/packages/master/vendors~floorplan.js"></script><script charset="utf-8" src="https://iltacon2022.expofp.com/packages/master/floorplan.js"></script></head>
<body>
<noscript>You need to enable JavaScript to run this app.</noscript>
<div  data-event-id="iltacon2022"><div></div></div>
<script src="https://iltacon2022.expofp.com/packages/master/expofp.js"></script>
</body></html>

CodePudding user response：

At times the AUT (Application under Test) tries to detect the internet-explorer browser used to access the application using jquery.

As per the discussion Jquery fail to detect IE 11 while internet-explorer-10 was getting detected properly, internet-explorer-11 wasn't getting detected as it was using a different userAgent:

Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv 11.0) like Gecko

The proposed beta solution was:

if (!!navigator.userAgent.match(/Trident\/7\./))
  return "ie";

which seems didn't get through. However the modified solution got implemented:

<script>
        if (window.navigator.userAgent.indexOf("Trident/") !== -1) {
        alert("Your are using old unsupported Internet Explorer browser.\nPlease upgrade to view this page properly.");
        }
</script>

which you observe within the <script> tag which implies, if the user-agent doesn't includes the string Trident you aren't using the updated IE v11 and you need to upgrade the Internet Explorer browser version.

Conclusion

The impact of this setting may be observed if you are using Internet Explorer browser else you can safely ignore this as it won't affect your tests.

CodePudding user response：

Your task is not trivial. Here is one possible solution:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time as t
import pandas as pd


chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
actions = ActionChains(browser)

url = 'https://iltacon2022.expofp.com/'
browser.get(url) 
c_list = []
parent_el = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//div[@data-event-id="iltacon2022"]/div')))
parent_el_shadow_root = parent_el.shadow_root 
t.sleep(5)
companies_div = parent_el_shadow_root.find_element(By.CSS_SELECTOR, 'div[]')
while True:
    try:
        companies = parent_el_shadow_root.find_elements(By.CSS_SELECTOR, "a[class = 'exhibitor-row list-row  ']")
        for c in companies:
            if len(c.text) > 3:
                c_list.append((c.text.replace('\n', ': '), c.get_attribute('href')))
        print(f'we found {len(c_list)} companies')
        actions.move_to_element(companies[len(c_list)]).perform()
        print("moving to element", companies[len(c_list)].text.replace('\n', ': '))
        t.sleep(1)
        companies[len(c_list)].send_keys(Keys.PAGE_DOWN)
        print('scrolled page down')
        t.sleep(2)
    except Exception as e:
        print('all done')
        break
df = pd.DataFrame(list(set(c_list)), columns = ['Company', 'Url'])
df.to_csv('surveillance_capitalists.csv')
print(df)

It's important to use Chrome/chromedriver, due to the way shadow root is located in the code above. The setup above is for linux, however you can create a working selenium/chromedriver setup on your machine, and then you just have to observe the imports, as well as the code after defining the browser/driver. The printout in the terminal will be quite verbose, it will tell you what's going on, and in the end will print out a dataframe with companies and their respective url (which will also save to disk as a csv file). You can then scrape those urls, just make sure you inspect every page properly, locate the shadow root and the elements inside it. Selenium documentation can be found at https://www.selenium.dev/documentation/

For any questions, just comment here, or ask in the Selenium chat room, which I imagine is very helpful.