Home > OS >  Code scrapes first webpage twice, but then scrapes the next six as it's meant to
Code scrapes first webpage twice, but then scrapes the next six as it's meant to

Time:09-24

I'm trying to scrape football scores from 8 pages online. For some reason my code is scraping the results from the first page twice, it goes on to scrape the next 6 pages as it should, then leaves out the final page.

Here is my code

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait

import time
import requests
import numpy as np

chrome_options = Options()
chrome_options.add_argument('headless')

driver = webdriver.Chrome(options=chrome_options)
wait = WebDriverWait(driver, 10)

scores = []

for i in range(1,9,1):
    url = 'https://www.oddsportal.com/soccer/england/premier-league-2020-2021/results/#/page/'   str(i)   '/'
    time.sleep(5)
    driver.get(url)
    
    soup = BeautifulSoup(driver.page_source, 'lxml')
    main_table = soup.find('table', class_ ='table-main')
    rows_of_interest = main_table.find_all('tr', class_ = ['odd deactivate', 'deactivate'])

        score = row.find('td', class_ = 'center bold table-odds table-score').text
        scores.append(score)

Help would be much appreciated

EDIT:

I fixed it by shifting the loop up by 1

for i in range(2,10,1):

I still have no idea why this works because the page numbers are 1-8

CodePudding user response:

You should put a delay between driver.get(url) and soup = BeautifulSoup(driver.page_source, 'lxml') to let the new page loaded.
Without that the first iteration reads the first page correctly since soup = BeautifulSoup(driver.page_source, 'lxml') action waits for page (any) to be loaded before scraping it content, but in the second iteration you will read the content of the first page again since the second page is still not loaded.
The time.sleep(5) command in it's wrong locating will cause all the next pages to be scraped but with delay of 1 iteration causing the last page to not being scraped.
With delay at the correct place it will work correctly

for i in range(1,9,1):
    url = 'https://www.oddsportal.com/soccer/england/premier-league-2020-2021/results/#/page/'   str(i)   '/'
    driver.get(url)
    time.sleep(5)
    
    soup = BeautifulSoup(driver.page_source, 'lxml')
  • Related