Home > Mobile >  Python Selenium scrape data when button "Load More" doesnt change URL
Python Selenium scrape data when button "Load More" doesnt change URL

Time:10-30

I am using the following code to attempt to keep clicking a "Load More" button until all page results are shown on the website:

from selenium import webdriver
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def startWebDriver():
    global driver
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--incognito")
    chrome_options.add_argument("--window-size=1920x1080")
    driver = webdriver.Chrome(options = chrome_options)
    
startWebDriver()
driver.get("https://together.bunq.com/all")
time.sleep(4)

while True: 
    try: 
        wait = WebDriverWait(driver, 10,10)
        element = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "[title='Load More']")))
        element.click()
        print("Loading more page results")
    except: 
        print("All page results displayed")
        break;   

However, since the button click does not change the URL, no new data is loaded into chromedriver and the while loop will break on the second iteration.

CodePudding user response:

Selenium is overkill for this. You only need requests. Logging one's network traffic reveals that at some point JavaScript makes an XHR HTTP GET request to a REST API endpoint, the response of which is JSON and contains all the information you're likely to want to scrape.

One of the query-string parameters for that endpoint URL is page[offset], which is used to offset the query results for pagination (in this case the "load more button"). A value of 0 corresponds to no offset, or "start at the beginning". Increment this value to suit your needs - in a loop would probably be a good place to do this.

Simply imitate that XHR HTTP GET request - copy the API endpoint URL and query-string parameters and request headers, then parse the JSON response:

def get_discussions():

    import requests

    url = "https://together.bunq.com/api/discussions"

    params = {
        "include": "user,lastPostedUser,tags,firstPost",
        "page[offset]": 0
    }

    headers = {
        "user-agent": "Mozilla/5.0"
    }

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()

    yield from response.json()["data"]


def main():
    for discussion in get_discussions():
        print(discussion["attributes"]["title"])
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

⚡️What’s new in App Update 18.8.0
Local Currencies Accounts Fees
Local Currencies            
  • Related