Home > database >  Multiple scraping: problem in the code. What am I doing wrong?
Multiple scraping: problem in the code. What am I doing wrong?

Time:11-07

I am trying to use Selenium scraping on multiple elements (for personal study reasons, so for personal didactic reasons, no-profit). A multiple scrape with multiple scraped elements that create a row that will fit into the database. I have never created a multiple scraping so far, but I have always scraped single elements. So there is some problem in the code.

I would like to create this row for each round (round 1, round 2, etc.) of the championship: Round, Date, Team_Home, Team_Away, Result_Home, Result_Away. In detail, just for information and to give you a better idea, there will be 8 rows for each championship round. The total turns are 26. I'm not getting any errors, but the output is just >>>. I only receive this >>>, with no text or errors.

P.S: The purpose of the request and the code is only for personal study reasons, so for personal didactic reasons, no profit. This question and this code is not for commercial or profit-making purposes.

I would like to get, for example, this:

#SWEDEN ALLSVENKAN
#Round, Date, Team_Home, Team_Away, Result_Home, Result_Away

Round 1, 11/31/2021 20:45, AIK Stockholm, Malmo, 2, 1
Round 1, 11/31/2021 20:45, Elfsborg, Gothenburg, 2, 3
...and the rest of the other matches of the 1st round

Round 2, 06/12/2021 20:45, Gothenburg, AIK Stockholm, 0, 1
Round 2, 06/12/2021 20:45, Malmo, Elfsborg, 1, 1
...and the rest of the other matches of the 2st round

Round 3, etc.

Python code for scraping:

Values_Allsvenskan = []

#SCRAPING
driver.get("link")
driver.implicitly_wait(12)
driver.minimize_window()

for Allsvenskan in multiple_scraping:

    try:
        wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
    except:
        pass

    multiple_scraping = round, date, team_home, team_away, score_home, score_away

    #row/record
    round = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__round event__round--static']")
    date = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__time']")
    team_home = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__participant event__participant--home']")            
    team_away = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__participant event__participant--away']")
    score_home = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__score event__score--home']")
    score_away = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__score event__score--away']")   


    Allsvenskan_text = round.text, date.text, team_home.text, team_away.text, score_home.text, score_away.text
    Values_Allsvenskan.append(tuple([Allsvenskan_text]))
    print(Allsvenskan_text)
driver.close


    #INSERT IN DATABASE
    con = sqlite3.connect('/database.db')
    cursor = con.cursor()
    sqlite_insert_query_Allsvenskan = 'INSERT INTO All_Score (round, date, team_home, team_away, score_home, score_away) VALUES (?, ?, ?, ?, ?, ?);'
    cursor.executemany(sqlite_insert_query_Allsvenskan, Values_Allsvenskan)
    con.commit()  

Based on my python code, can you show me how I can fix and fix the code? Thanks

UPDATE FOR INSERT IN DATABASE

#INSERT IN DATABASE
con = sqlite3.connect('database.db')
cursor = con.cursor()
sqlite_insert_query_Allsvenskan = 'INSERT INTO All_Score(current_round, date, team_home, team_away, score_home, score_away) VALUES (?, ?, ?, ?, ?, ?);'
cursor.executemany(sqlite_insert_query_Allsvenskan, results = [])
con.commit()  

FINAL UPDATE FOR LOGIC CODE, AFTER THE FINAL ANSWER: I ADD ONLY THE COMMENT TO EXPLAIN THE STEPS. If I miss a comment or need to add something, go ahead. I want to make sure I understand the logic of the code

#I search for rows with event__round or event__match
all_rows = driver.find_elements(By.CSS_SELECTOR, "div[class^='event__round'],div[class^='event__match']")

#Initializing an empty list
results = []

#Value default of the round before the for loop
current_round = '?'

#Check which classes of event__round and event__match have lines. It is used to recognize the row with Round?????
for row in all_rows:
     classes = row.get_attribute ('class')

## If round number and match both have rows, then I use find_element to get the rest of the other data to scrape
    if.........
    else.....

CodePudding user response:

You use find_elements to get lists with all rounds, all date, all team_home, all team_away, etc. so you have values in separated list and you should use zip() to group values in lists like [single round, single date, single team_home, ...]`

results = []

for row in zip(date, team_home, team_away, score_home, score_away):
    row = [item.text for item in row]
    print(row)
    results.append(row)

I skiped round because it makes more problems it will need totally differnt code.

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
#driver.minimize_window()

wait = WebDriverWait(driver, 10)

try:
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
except Exception as ex:
    print('EX:', ex)

round = driver.find_elements(By.CSS_SELECTOR, "[class^='event__round event__round--static']")
date = driver.find_elements(By.CSS_SELECTOR, "[class^='event__time']") #data e ora è tutto un pezzo su diretta.it
team_home = driver.find_elements(By.CSS_SELECTOR, "[class^='event__participant event__participant--home']")            
team_away = driver.find_elements(By.CSS_SELECTOR, "[class^='event__participant event__participant--away']")
score_home = driver.find_elements(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
score_away = driver.find_elements(By.CSS_SELECTOR, "[class^='event__score event__score--away']")   

results = []

for row in zip(date, team_home, team_away, score_home, score_away):
    row = [item.text for item in row]
    print(row)
    results.append(row)

Result:

['01.11. 19:00', 'Degerfors', 'Göteborg', '0', '1']
['01.11. 19:00', 'Halmstad', 'AIK Stockholm', '1', '0']
['01.11. 19:00', 'Mjallby', 'Hammarby', '2', '0']
['31.10. 17:30', 'Örebro', 'Djurgarden', '0', '1']
['31.10. 15:00', 'Norrkoping', 'Elfsborg', '3', '2']
['30.10. 17:30', 'Hacken', 'Kalmar', '1', '4']
['30.10. 15:00', 'Sirius', 'Malmo FF', '2', '3']
['30.10. 15:00', 'Varbergs', 'Östersunds', '3', '0']
['28.10. 19:00', 'Degerfors', 'Elfsborg', '1', '2']
['28.10. 19:00', 'Göteborg', 'Djurgarden', '3', '0']
['28.10. 19:00', 'Halmstad', 'Örebro', '1', '1']
['28.10. 19:00', 'Norrkoping', 'Mjallby', '2', '2']
['27.10. 19:00', 'Kalmar', 'Varbergs', '2', '2']
['27.10. 19:00', 'Malmo FF', 'AIK Stockholm', '1', '0']
['27.10. 19:00', 'Östersunds', 'Hacken', '1', '1']
['27.10. 19:00', 'Sirius', 'Hammarby', '0', '1']
['25.10. 19:00', 'Örebro', 'Degerfors', '1', '2']
['24.10. 17:30', 'AIK Stockholm', 'Norrkoping', '1', '0']
...

But this method may sometimes makes problem - if some row has empty place then it will move value from next row to current row, etc. This way it can create wrong rows.

Better is to find all rows (div or tr in table) and next use for-loop to work with every row separatelly and use row.find_elements instead of driver.find_elements. This should also resolve problem with round which will need to read value and later duplicate it in next rows.

I search rows with event__round or event__match and next I check what classes has row. If it has event__round then I get round. If it has event__match then I use find_element without s at the end to get single date, single team_home, single team_away, etc (because in single row there are only single values) and use them with current_round to create row.

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
#driver.minimize_window()

wait = WebDriverWait(driver, 10)

try:
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
except Exception as ex:
    print('EX:', ex)

all_rows = driver.find_elements(By.CSS_SELECTOR, "div[class^='event__round'],div[class^='event__match']")

results = []

current_round = '?'

for row in all_rows:
    classes = row.get_attribute('class')
    #print(classes)
    
    if 'event__round' in classes:
        #round = row.find_elements(By.CSS_SELECTOR, "[class^='event__round event__round--static']")
        #current_round = row.text  # full text `Round 20`
        current_round = row.text.split(" ")[-1]  # only `20` without `Round`
    else:
        datetime = row.find_element(By.CSS_SELECTOR, "[class^='event__time']")
        
        date, time = datetime.text.split(" ")
        date = date.rstrip('.')  # right-strip to remove `.` at the end of date
        
        team_home = row.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--home']")            
        team_away = row.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--away']")
        score_home = row.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
        score_away = row.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--away']")   

        # old version
        #row = [current_round, datetime.text, team_home.text, team_away.text, score_home.text, score_away.text]
    
        row = [current_round, date, time, team_home.text, team_away.text, score_home.text, score_away.text]
        results.append(row)
        print(row)

# --- database ---

import sqlite3

con = sqlite3.connect('database.db')
cursor = con.cursor()

query = 'DROP TABLE IF EXISTS All_Score;'
cursor.execute(query)

# old version - with only `date`
#query = 'CREATE TABLE IF NOT EXISTS All_Score(current_round, date, team_home, team_away, score_home, score_away);'
# new version - with `date` and `time`
query = 'CREATE TABLE IF NOT EXISTS All_Score(current_round, date, time, team_home, team_away, score_home, score_away);'
cursor.execute(query)

# old version - with only `date`
#query = 'INSERT INTO All_Score(current_round, date, team_home, team_away, score_home, score_away) VALUES (?, ?, ?, ?, ?, ?);'
# new version - with `date` and `time`
query = 'INSERT INTO All_Score(current_round, date, time, team_home, team_away, score_home, score_away) VALUES (?, ?, ?, ?, ?, ?, ?);'
cursor.executemany(query, results)

con.commit()   

Result:

['Giornata 26', '01.11. 19:00', 'Degerfors', 'Göteborg', '0', '1']
['Giornata 26', '01.11. 19:00', 'Halmstad', 'AIK Stockholm', '1', '0']
['Giornata 26', '01.11. 19:00', 'Mjallby', 'Hammarby', '2', '0']
['Giornata 26', '31.10. 17:30', 'Örebro', 'Djurgarden', '0', '1']
['Giornata 26', '31.10. 15:00', 'Norrkoping', 'Elfsborg', '3', '2']
['Giornata 26', '30.10. 17:30', 'Hacken', 'Kalmar', '1', '4']
['Giornata 26', '30.10. 15:00', 'Sirius', 'Malmo FF', '2', '3']
['Giornata 26', '30.10. 15:00', 'Varbergs', 'Östersunds', '3', '0']

['Giornata 25', '28.10. 19:00', 'Degerfors', 'Elfsborg', '1', '2']
['Giornata 25', '28.10. 19:00', 'Göteborg', 'Djurgarden', '3', '0']
['Giornata 25', '28.10. 19:00', 'Halmstad', 'Örebro', '1', '1']
# ...
  • Related