Python Selenium scrape data when button "Load More" doesnt change URL-CodePudding

I am using the following code to attempt to keep clicking a "Load More" button until all page results are shown on the website:

from selenium import webdriver
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def startWebDriver():
    global driver
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--incognito")
    chrome_options.add_argument("--window-size=1920x1080")
    driver = webdriver.Chrome(options = chrome_options)
    
startWebDriver()
driver.get("https://together.bunq.com/all")
time.sleep(4)

while True: 
    try: 
        wait = WebDriverWait(driver, 10,10)
        element = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "[title='Load More']")))
        element.click()
        print("Loading more page results")
    except: 
        print("All page results displayed")
        break;

However, since the button click does not change the URL, no new data is loaded into chromedriver and the while loop will break on the second iteration.

CodePudding user response：

Selenium is overkill for this. You only need requests. Logging one's network traffic reveals that at some point JavaScript makes an XHR HTTP GET request to a REST API endpoint, the response of which is JSON and contains all the information you're likely to want to scrape.

One of the query-string parameters for that endpoint URL is page[offset], which is used to offset the query results for pagination (in this case the "load more button"). A value of 0 corresponds to no offset, or "start at the beginning". Increment this value to suit your needs - in a loop would probably be a good place to do this.

Simply imitate that XHR HTTP GET request - copy the API endpoint URL and query-string parameters and request headers, then parse the JSON response:

def get_discussions():

    import requests

    url = "https://together.bunq.com/api/discussions"

    params = {
        "include": "user,lastPostedUser,tags,firstPost",
        "page[offset]": 0
    }

    headers = {
        "user-agent": "Mozilla/5.0"
    }

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()

    yield from response.json()["data"]


def main():
    for discussion in get_discussions():
        print(discussion["attributes"]["title"])
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

⚡️What’s new in App Update 18.8.0
Local Currencies Accounts Fees
Local Currencies