I'm trying to scrape football scores from 8 pages online. For some reason my code is scraping the results from the first page twice, it goes on to scrape the next 6 pages as it should, then leaves out the final page.
Here is my code
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
import time
import requests
import numpy as np
chrome_options = Options()
chrome_options.add_argument('headless')
driver = webdriver.Chrome(options=chrome_options)
wait = WebDriverWait(driver, 10)
scores = []
for i in range(1,9,1):
url = 'https://www.oddsportal.com/soccer/england/premier-league-2020-2021/results/#/page/' str(i) '/'
time.sleep(5)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
main_table = soup.find('table', class_ ='table-main')
rows_of_interest = main_table.find_all('tr', class_ = ['odd deactivate', 'deactivate'])
score = row.find('td', class_ = 'center bold table-odds table-score').text
scores.append(score)
Help would be much appreciated
EDIT:
I fixed it by shifting the loop up by 1
for i in range(2,10,1):
I still have no idea why this works because the page numbers are 1-8
CodePudding user response:
You should put a delay between driver.get(url)
and soup = BeautifulSoup(driver.page_source, 'lxml')
to let the new page loaded.
Without that the first iteration reads the first page correctly since
soup = BeautifulSoup(driver.page_source, 'lxml')
action waits for page (any) to be loaded before scraping it content, but in the second iteration you will read the content of the first page again since the second page is still not loaded.
The time.sleep(5)
command in it's wrong locating will cause all the next pages to be scraped but with delay of 1 iteration causing the last page to not being scraped.
With delay at the correct place it will work correctly
for i in range(1,9,1):
url = 'https://www.oddsportal.com/soccer/england/premier-league-2020-2021/results/#/page/' str(i) '/'
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')