Home > Software design >  How can I scrape all rows from a specific table?
How can I scrape all rows from a specific table?

Time:01-07

I'm new to Python and Selenium and having the following problem:

I want to scrape certain columns from a table. The table is a table of the German Bundesliga and there are two buttons above it to change the season and the matchday. I want to scrape the data only for the 2021/22 season, but for all matchdays.

I managed to get all rows of the table for one matchday, but I don't know how to "iterate" over all matchdays.

I greatly appreciate any help on this!

Below is my last try. The result list only gives me the first 5 rows per matchday. I don't understand why and I'm running out of ideas how to correct my code. I'm not sure if it's a good idea to use xpath here, but I couldn't figure out another way to find the correct column entries (maybe, it's better to use class instead of xpath).

What I would expect is a list containing all rows of the first table on the page for all 34 matchdays (18 x 34 rows).

element_list = []
  
for matchday in range(1, 35, 1):
    
    url = 'https://www.kicker.de/bundesliga/tabelle/2021-22/'   str(matchday)
    driver = webdriver.Chrome()
    driver.get(url)
    position = driver.find_elements(By.XPATH, "/html[@class=' kick__nodark']/body/div[@id='kick__page-container']/div[@id='kick__page']/div[@class='kick__area--main']/div[@class='kick__card '][1]/div[@class='kick__data-grid']/div[@class='kick__data-grid__main ']/div[@class='kick__site-padding']/div[@id='1']/div[@class='kick__module-margin']/table[@class='kick__table kick__table--ranking kick__table--alternate kick__table--resptabelle']/tbody/tr/td[@class='kick__table--ranking__rank kick__respt-m-o-1 kick__respt-m-w-25']")
    team = driver.find_elements(By.XPATH, "/html[@class=' kick__nodark']/body/div[@id='kick__page-container']/div[@id='kick__page']/div[@class='kick__area--main']/div[@class='kick__card '][1]/div[@class='kick__data-grid']/div[@class='kick__data-grid__main ']/div[@class='kick__site-padding']/div[@id='1']/div[@class='kick__module-margin']/table[@class='kick__table kick__table--ranking kick__table--alternate kick__table--resptabelle']/tbody/tr/td[@class='kick__table--ranking__teamname kick__table--ranking__index kick__t__a__l kick__respt-m-o-4 kick__respt-m-w-120 kick__t__a__l']/a/span[@class='kick__table--show-desktop']")
    matchday = driver.find_elements(By.XPATH, "/html[@class=' kick__nodark']/body/div[@id='kick__page-container']/div[@id='kick__page']/div[@class='kick__area--main']/div[@class='kick__card '][1]/div[@class='kick__data-grid']/div[@class='kick__data-grid__main ']/div[@class='kick__site-padding']/div[@id='1']/div[@class='kick__module-margin']/table[@class='kick__table kick__table--ranking kick__table--alternate kick__table--resptabelle']/tbody/tr/td[@class='kick__table--ranking__number'][1]")
    goals = driver.find_elements(By.XPATH, "/html[@class=' kick__nodark']/body/div[@id='kick__page-container']/div[@id='kick__page']/div[@class='kick__area--main']/div[@class='kick__card '][1]/div[@class='kick__data-grid']/div[@class='kick__data-grid__main ']/div[@class='kick__site-padding']/div[@id='1']/div[@class='kick__module-margin']/table[@class='kick__table kick__table--ranking kick__table--alternate kick__table--resptabelle']/tbody/tr/td[@class='kick__table--ranking__number'][2]")
    points = driver.find_elements(By.XPATH, "/html[@class=' kick__nodark']/body/div[@id='kick__page-container']/div[@id='kick__page']/div[@class='kick__area--main']/div[@class='kick__card '][1]/div[@class='kick__data-grid']/div[@class='kick__data-grid__main ']/div[@class='kick__site-padding']/div[@id='1']/div[@class='kick__module-margin']/table[@class='kick__table kick__table--ranking kick__table--alternate kick__table--resptabelle']/tbody/tr/td[@class='kick__table--ranking__master kick__respt-m-o-5']")
  
    for i in range(len(team)):
        element_list.append([position[i].text, team[i].text, matchday[i].text, goals[i].text, points[i].text])
  
# closing the driver
driver.close()

CodePudding user response:

Without selenium, I modified your code as follows for get the results you're looking for:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time as t

final_df = None # will contain the dataframe with all the data obtained

# Loop for make a request by each iteration - I've modified the "35" to "5" 
# for generate a shortened version of the desired results
for matchday in range(1, 5, 1):
  # Get the "table" HTML elements that exist on the page:  
  scrapper = pd.read_html('https://www.kicker.de/bundesliga/tabelle/2021-22/'   str(matchday))
  
  # In my tests, the 0 position is the one that contains the desired data:       
  for idx, table in enumerate(scrapper):
    if (idx == 0): 
      # Remove certain columns not needed: 
      table = table.drop(table.columns[[0, 1, 2]], axis=1)

      # Build the final dataframe - by initializing the variable 
      # or append the "table" to the final dataframe:
      if (final_df is None): 
        final_df = table
      else: 
        final_df = final_df.append(table, ignore_index=True)

# Display the final dataframe: 
display(final_df)

Results - shortened:

index Team Sp. ss-u-n U N Tore Diff. Punkte
0 Wolfsburg VfL Wolfsburg 4 4-0-0 4 0 0 6:1 5 12
1 Bayern (M) Bayern München (M) 4 3-1-0 3 1 0 13:4 9 10
2 Dortmund (P) Borussia Dortmund (P) 4 3-0-1 3 0 1 13:9 4 9
3 Mainz 1. FSV Mainz 05 4 3-0-1 3 0 1 6:2 4 9
4 Freiburg SC Freiburg 4 2-2-0 2 2 0 6:4 2 8
5 Leverkusen Bayer 04 Leverkusen 4 2-1-1 2 1 1 12:6 6 7
6 Köln 1. FC Köln 4 2-1-1 2 1 1 8:6 2 7
7 Union 1. FC Union Berlin 4 1-3-0 1 3 0 5:4 1 6
8 Hoffenheim TSG Hoffenheim 4 1-1-2 1 1 2 8:7 1 4
9 Stuttgart VfB Stuttgart 4 1-1-2 1 1 2 8:9 -1 4
10 Gladbach Bor. Mönchengladbach 4 1-1-2 1 1 2 5:8 -3 4
11 Leipzig RB Leipzig 4 1-0-3 1 0 3 5:6 -1 3
12 Bochum (N) VfL Bochum (N) 4 1-0-3 1 0 3 4:6 -2 3
13 Bielefeld Arminia Bielefeld 4 0-3-1 0 3 1 3:5 -2 3
14 Frankfurt Eintracht Frankfurt 4 0-3-1 0 3 1 4:7 -3 3
15 Hertha Hertha BSC 4 1-0-3 1 0 3 5:11 -6 3
16 Augsburg FC Augsburg 4 0-2-2 0 2 2 1:8 -7 2
17 Fürth (N) SpVgg Greuther Fürth (N) 4 0-1-3 0 1 3 2:11 -9 1
18 Stuttgart VfB Stuttgart 1 1-0-0 1 0 0 5:1 4 3
19 Hoffenheim TSG Hoffenheim 1 1-0-0 1 0 0 4:0 4 3
20 Dortmund (P) Borussia Dortmund (P) 1 1-0-0 1 0 0 5:2 3 3
21 Köln 1. FC Köln 1 1-0-0 1 0 0 3:1 2 3
22 Mainz 1. FSV Mainz 05 1 1-0-0 1 0 0 1:0 1 3
23 Wolfsburg VfL Wolfsburg 1 1-0-0 1 0 0 1:0 1 3
  • Related