I'm new to Python and Selenium and having the following problem:
I want to scrape certain columns from a table. The table is a table of the German Bundesliga and there are two buttons above it to change the season and the matchday. I want to scrape the data only for the 2021/22 season, but for all matchdays.
I managed to get all rows of the table for one matchday, but I don't know how to "iterate" over all matchdays.
I greatly appreciate any help on this!
Below is my last try. The result list only gives me the first 5 rows per matchday. I don't understand why and I'm running out of ideas how to correct my code. I'm not sure if it's a good idea to use xpath here, but I couldn't figure out another way to find the correct column entries (maybe, it's better to use class instead of xpath).
What I would expect is a list containing all rows of the first table on the page for all 34 matchdays (18 x 34 rows).
element_list = []
for matchday in range(1, 35, 1):
url = 'https://www.kicker.de/bundesliga/tabelle/2021-22/' str(matchday)
driver = webdriver.Chrome()
driver.get(url)
position = driver.find_elements(By.XPATH, "/html[@class=' kick__nodark']/body/div[@id='kick__page-container']/div[@id='kick__page']/div[@class='kick__area--main']/div[@class='kick__card '][1]/div[@class='kick__data-grid']/div[@class='kick__data-grid__main ']/div[@class='kick__site-padding']/div[@id='1']/div[@class='kick__module-margin']/table[@class='kick__table kick__table--ranking kick__table--alternate kick__table--resptabelle']/tbody/tr/td[@class='kick__table--ranking__rank kick__respt-m-o-1 kick__respt-m-w-25']")
team = driver.find_elements(By.XPATH, "/html[@class=' kick__nodark']/body/div[@id='kick__page-container']/div[@id='kick__page']/div[@class='kick__area--main']/div[@class='kick__card '][1]/div[@class='kick__data-grid']/div[@class='kick__data-grid__main ']/div[@class='kick__site-padding']/div[@id='1']/div[@class='kick__module-margin']/table[@class='kick__table kick__table--ranking kick__table--alternate kick__table--resptabelle']/tbody/tr/td[@class='kick__table--ranking__teamname kick__table--ranking__index kick__t__a__l kick__respt-m-o-4 kick__respt-m-w-120 kick__t__a__l']/a/span[@class='kick__table--show-desktop']")
matchday = driver.find_elements(By.XPATH, "/html[@class=' kick__nodark']/body/div[@id='kick__page-container']/div[@id='kick__page']/div[@class='kick__area--main']/div[@class='kick__card '][1]/div[@class='kick__data-grid']/div[@class='kick__data-grid__main ']/div[@class='kick__site-padding']/div[@id='1']/div[@class='kick__module-margin']/table[@class='kick__table kick__table--ranking kick__table--alternate kick__table--resptabelle']/tbody/tr/td[@class='kick__table--ranking__number'][1]")
goals = driver.find_elements(By.XPATH, "/html[@class=' kick__nodark']/body/div[@id='kick__page-container']/div[@id='kick__page']/div[@class='kick__area--main']/div[@class='kick__card '][1]/div[@class='kick__data-grid']/div[@class='kick__data-grid__main ']/div[@class='kick__site-padding']/div[@id='1']/div[@class='kick__module-margin']/table[@class='kick__table kick__table--ranking kick__table--alternate kick__table--resptabelle']/tbody/tr/td[@class='kick__table--ranking__number'][2]")
points = driver.find_elements(By.XPATH, "/html[@class=' kick__nodark']/body/div[@id='kick__page-container']/div[@id='kick__page']/div[@class='kick__area--main']/div[@class='kick__card '][1]/div[@class='kick__data-grid']/div[@class='kick__data-grid__main ']/div[@class='kick__site-padding']/div[@id='1']/div[@class='kick__module-margin']/table[@class='kick__table kick__table--ranking kick__table--alternate kick__table--resptabelle']/tbody/tr/td[@class='kick__table--ranking__master kick__respt-m-o-5']")
for i in range(len(team)):
element_list.append([position[i].text, team[i].text, matchday[i].text, goals[i].text, points[i].text])
# closing the driver
driver.close()
CodePudding user response:
Without selenium, I modified your code as follows for get the results you're looking for:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time as t
final_df = None # will contain the dataframe with all the data obtained
# Loop for make a request by each iteration - I've modified the "35" to "5"
# for generate a shortened version of the desired results
for matchday in range(1, 5, 1):
# Get the "table" HTML elements that exist on the page:
scrapper = pd.read_html('https://www.kicker.de/bundesliga/tabelle/2021-22/' str(matchday))
# In my tests, the 0 position is the one that contains the desired data:
for idx, table in enumerate(scrapper):
if (idx == 0):
# Remove certain columns not needed:
table = table.drop(table.columns[[0, 1, 2]], axis=1)
# Build the final dataframe - by initializing the variable
# or append the "table" to the final dataframe:
if (final_df is None):
final_df = table
else:
final_df = final_df.append(table, ignore_index=True)
# Display the final dataframe:
display(final_df)
Results - shortened:
index | Team | Sp. | ss-u-n | U | N | Tore | Diff. | Punkte |
---|---|---|---|---|---|---|---|---|
0 | Wolfsburg VfL Wolfsburg | 4 | 4-0-0 4 | 0 | 0 | 6:1 | 5 | 12 |
1 | Bayern (M) Bayern München (M) | 4 | 3-1-0 3 | 1 | 0 | 13:4 | 9 | 10 |
2 | Dortmund (P) Borussia Dortmund (P) | 4 | 3-0-1 3 | 0 | 1 | 13:9 | 4 | 9 |
3 | Mainz 1. FSV Mainz 05 | 4 | 3-0-1 3 | 0 | 1 | 6:2 | 4 | 9 |
4 | Freiburg SC Freiburg | 4 | 2-2-0 2 | 2 | 0 | 6:4 | 2 | 8 |
5 | Leverkusen Bayer 04 Leverkusen | 4 | 2-1-1 2 | 1 | 1 | 12:6 | 6 | 7 |
6 | Köln 1. FC Köln | 4 | 2-1-1 2 | 1 | 1 | 8:6 | 2 | 7 |
7 | Union 1. FC Union Berlin | 4 | 1-3-0 1 | 3 | 0 | 5:4 | 1 | 6 |
8 | Hoffenheim TSG Hoffenheim | 4 | 1-1-2 1 | 1 | 2 | 8:7 | 1 | 4 |
9 | Stuttgart VfB Stuttgart | 4 | 1-1-2 1 | 1 | 2 | 8:9 | -1 | 4 |
10 | Gladbach Bor. Mönchengladbach | 4 | 1-1-2 1 | 1 | 2 | 5:8 | -3 | 4 |
11 | Leipzig RB Leipzig | 4 | 1-0-3 1 | 0 | 3 | 5:6 | -1 | 3 |
12 | Bochum (N) VfL Bochum (N) | 4 | 1-0-3 1 | 0 | 3 | 4:6 | -2 | 3 |
13 | Bielefeld Arminia Bielefeld | 4 | 0-3-1 0 | 3 | 1 | 3:5 | -2 | 3 |
14 | Frankfurt Eintracht Frankfurt | 4 | 0-3-1 0 | 3 | 1 | 4:7 | -3 | 3 |
15 | Hertha Hertha BSC | 4 | 1-0-3 1 | 0 | 3 | 5:11 | -6 | 3 |
16 | Augsburg FC Augsburg | 4 | 0-2-2 0 | 2 | 2 | 1:8 | -7 | 2 |
17 | Fürth (N) SpVgg Greuther Fürth (N) | 4 | 0-1-3 0 | 1 | 3 | 2:11 | -9 | 1 |
18 | Stuttgart VfB Stuttgart | 1 | 1-0-0 1 | 0 | 0 | 5:1 | 4 | 3 |
19 | Hoffenheim TSG Hoffenheim | 1 | 1-0-0 1 | 0 | 0 | 4:0 | 4 | 3 |
20 | Dortmund (P) Borussia Dortmund (P) | 1 | 1-0-0 1 | 0 | 0 | 5:2 | 3 | 3 |
21 | Köln 1. FC Köln | 1 | 1-0-0 1 | 0 | 0 | 3:1 | 2 | 3 |
22 | Mainz 1. FSV Mainz 05 | 1 | 1-0-0 1 | 0 | 0 | 1:0 | 1 | 3 |
23 | Wolfsburg VfL Wolfsburg | 1 | 1-0-0 1 | 0 | 0 | 1:0 | 1 | 3 |