Home > Enterprise >  how to scrape a website with pages that dont affect the url
how to scrape a website with pages that dont affect the url

Time:10-02

What I'd like to do is to scrape a Clash of Clans players profile site from clashofstats.com for instance: https://www.clashofstats.com/players/captain-morgan-L9YJUPY22/history/log

to get an approximation of the last and first time played based on the logged clan activity, first of which I have already implemented with the following:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://www.clashofstats.com/players/captain-morgan
L9YJUPY22/history/log')

soup = BeautifulSoup(response.content, 'html.parser')

end_dates = soup.find_all(class_="end date")

last_played = str(end_dates[0]).replace("<span class=\"end date\">", "")
last_played = last_played.replace(",", "")
last_played = last_played.replace("</span>", "")

print(f"Last  time played| {last_played}")

Output: Last time played| Sep 30 2022 (weird format is to match the rest of the code)

Now back to my question, the problem comes with the first time played. clashofstats has multiple pages of logged clans, but when going to the last page (where the first date is) the url doesn't change and nor does the source code. I can only see changes from the dev Tools, but how can I direct, preferably using BeautifulSoup, to that last page and get the date?, if that is even possible.

CodePudding user response:

Since the 'next' and page button are not hyperlinked and the site doesn't seem to be loading the data via any easy-to-find APIs, I expect it would take a horribly convoluted process of requesting and parsing scripts to retrieve this one date from the last page.

Instead, you could use selenium to click the last page button [(2) in this case] and then get the last date. (If you haven't ever used selenium before, I found this to a be a very helpful starting point.)

from selenium import webdriver
from selenium.webdriver.common.by import By

chromeDriver_path = 'chromedriver.exe'
# I just copied the exe file to the same folder as this py file
driver = webdriver.Chrome(chromeDriver_path)

tag = "L9YJUPY22"  # your player tag without '#'

driver.get('https://www.clashofstats.com/players/'   tag   '/history/log')

start_dates = driver.find_elements(By.CSS_SELECTOR, 'span.start.date')
end_dates = driver.find_elements(By.CSS_SELECTOR, 'span.end.date')
page_btns = driver.find_elements(By.CSS_SELECTOR, 'button.v-pagination__item')

try:
    clan = driver.find_element(By.CSS_SELECTOR, 'div.v-list-item__title.text--secondary.font-italic').get_attribute(
    'innerText')
except:
    clan = "something else"

if start_dates[0].get_attribute('innerText') == end_dates[0].get_attribute('innerText') or clan == "Not in any Clans":
    # basically the inactivity time can't really be seen from just the clan, so it only works on
    # players that don't have a clan. this is sometimes displayed weird by the website, thus this logic
    last_played = start_dates[1].get_attribute('innerText')
    # only start date is available when turning inactive
else:
    last_played = end_dates[0].get_attribute('innerText')  # in a normal case this is "today"

# see if there are more pages
if len(page_btns) > 0 and len(end_dates) > 0:
    page_btns[-1].click()  # click the last page_btn
    start_dates = driver.find_elements(By.CSS_SELECTOR, 'span.start.date')  # update start dates

first_played = start_dates[-1].get_attribute('innerText')  # get last start date

print(f"First time played| {first_played}")
print(f"Last  time played| {last_played}")

driver.close()  # else the window stays open and your program keeps running

Output:

First time played| Sep 8, 2020
Last  time played| Oct 1, 2022

This should work standalone for any number of pages.

  • Related