Home > Software design >  How to scrape match data with its corresponding date?
How to scrape match data with its corresponding date?

Time:01-07

I am trying to scrape a website for Cricket club results, but the number of matches on a certain date are not fixed. Eg: Saturday 17 September 2022 has 1 match, and Saturday 10 September 2022 has 3 matches. It would've been simple if the website separated the dates in different classes or tables but it doesn't seem to be the case.

import requests
import urllib3
import pandas as pd
from html.parser import HTMLParser
from bs4 import BeautifulSoup

The URL for the website as shown below:-

#Url = Page 1 of results
url = 'https://halstead.play-cricket.com/Matches?fixture_month=13&home_or_away=both&page=1&q[category_id]=all&q[gender_id]=all&search_in=&season_id=255&seasonchange=f&selected_season_id=255&tab=Result&team_id=&utf8=✓&view_by=year'

data = requests.get(url).text
soup = BeautifulSoup(data, 'lxml')

Main Code

#Creating the table
main_lists = {'Team A':[], '':[],'Team B':[]}
entries = soup.findAll('p', class_='txt1')
list = []

for entries in entries:
    #Everything in one list
    list.append(entries.text.strip('/n'))

#Separating entries by odd and even index
l = range(len(list))
list_even = list[::2]
list_odd = list [1::2]

for list_even in list_even:
    main_lists['Team A']  = [list_even]
    main_lists['']  = ['vs']
for list_odd in list_odd:
    main_lists['Team B']  = [list_odd]

#Turn lists into dataframe
df_main = pd.DataFrame(main_lists)

#Getting result
res_list = []
x = 0
while x < df_main.shape[0]:
    res = soup.select('.fonts-gt')[x];x  = 1
    res_list.append(res.text)
    res_list = [sub.replace('  ',' ') for sub in res_list]

df_main['Result'] = res_list
df_main = df_main.reindex(columns=['Result', 'Team A', 'Team B'])

#Getting the Date
date = soup.findAll('div', class_='col-sm-12 text-center text-md-left title2 padding_top_for_mobile')

date_table = []
for date in date:
    date_table.append(date.text.strip('\n'))
    date_table2 = [sub.replace('2022\n', '2022') for sub in date_table]
df_date = pd.DataFrame(date_table2)

print(f'The length of df_main is {len(df_main)}, and the length of df_date is {len(df_date)}')

Here we can see the difference in number of rows of both data frames.

The length of df_main is 25, and the length of df_date is 12

I did try using something like:-

items = soup.find_all(class_=['row ml-large-0 mr-large-0','col-sm-12 d-md-none match-status-mobile'])
for item in items:
    print(item.text)

Which gives something like: But I still have no idea how to separate these by the date.

CodePudding user response:

Try to change your selection strategy and extract information in one go. Iterate over the matches and use find_previous() to extract the respective date under which the match is listed.

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs

Example

import requests
from bs4 import BeautifulSoup

url='https://halstead.play-cricket.com/Matches?fixture_month=13&home_or_away=both&page=1&q[category_id]=all&q[gender_id]=all&search_in=&season_id=255&seasonchange=f&selected_season_id=255&tab=Result&team_id=&utf8=✓&view_by=year'
soup = BeautifulSoup(requests.get(url).text)
data = []

for e in soup.select('table'):
    d = dict(zip(['Team A','Team B'],[t.text for t in e.select('.txt1')]))
    d.update({
        'Result':e.div.text,
        'Date':e.find_previous('div',class_='title2').get_text(strip=True)
    })
    data.append(d)

pd.DataFrame(data)

Output

Team A Team B Result Date
0 Halstead CC, Essex - 1st XI Fakenham CC - 1st XI FAKENHAM CC WON BY 8 WICKETS Saturday 17 September 2022
1 Battisford & District CC - Saturday 1st XI Halstead CC, Essex - 2nd XI CANCELLED Saturday 10 September 2022
2 Maldon CC - 4th XI Halstead CC, Essex - 3rd XI CANCELLED Saturday 10 September 2022
3 Halstead CC, Essex - 1st XI Worlington CC - 1st XI HALSTEAD CC, ESSEX WON BY 23 RUNS Saturday 10 September 2022
... ... ... ... ...
21 Halstead CC, Essex - NECL 1st XI Wickham St Pauls CC - 1st XI WICKHAM ST PAULS CC WON BY 2 WICKETS Sunday 31 July 2022
22 Halstead CC, Essex - 2nd XI West Bergholt CC - Two Counties 1st XI HALSTEAD CC, ESSEX WON BY 3 RUNS Saturday 30 July 2022
23 Halstead CC, Essex - 3rd XI Abberton & District CC - 3rd XI ABBERTON & DISTRICT CC WON BY 6 WICKETS Saturday 30 July 2022
24 Coggeshall Town CC - 1st XI Halstead CC, Essex - 1st XI HALSTEAD CC, ESSEX WON BY 117 RUNS Saturday 30 July 2022

CodePudding user response:

It was difficult, but, once I change the approach to get the data, I could get the results you want.

The changes were:

  • Get the data inside the table HTML element - which is the element that has the matches (i.e. Team A, Team B and result) .
  • For the date, get the text inside the div when this said div does not have a table HTML element.

Here is the modified code:

import requests
import urllib3
import pandas as pd
from html.parser import HTMLParser
import bs4
from bs4 import BeautifulSoup

#Url = Page 1 of results
url = 'https://halstead.play-cricket.com/Matches?fixture_month=13&home_or_away=both&page=1&q[category_id]=all&q[gender_id]=all&search_in=&season_id=255&seasonchange=f&selected_season_id=255&tab=Result&team_id=&utf8=✓&view_by=year'
data = requests.get(url).text
soup = BeautifulSoup(data, 'lxml')

# Variables: 
result_array = [] # It will contain the list of dictionaries with the desired data.
team_a = "" # Value for "Team A" column.
team_b = "" # Value for "Team B" column.
result = "" # Value for "Result" column. 
temp_date = "" # Stores the date.

# Get the main div that contains the data to extract: 
div_principal = soup.find("div", class_="tab-pane fade active in")

# Loop the divs "i.e. the data of the matches and the date" found in the main container div: 
for indx, div in enumerate(div_principal):
  # Loop only the valid elements with the data to get:
  if (indx >= 5 and indx <= 77): 
    # Check if the element is a valid BeautifulSoup "Tag" element: 
    # Source: https://stackoverflow.com/a/42802393/12511801
    if type(div) is bs4.element.Tag: 
      # If no "table" is found on this div, then, it's a "date" - that's due I'm getting the "data" contained on the "table" HTML element: 
      if (div.find("table") is not None):
        # Get the values for "Team A", "Team B" and "Result": 
        team_a = div.find('table').find("p", class_="txt1").get_text("", strip=True)
        team_b = div.find('table').find("td", class_="col-xs-5 col-sm-3 col-md-3 text-center text-md-left card-table-r bgGray").find("p", class_="txt1").get_text("", strip=True)
        # Here the original text has double spaces. 
        # Example: FAKENHAM CC WON  BY 8 WICKETS 
        # (notice the double spaces between the words "WON" and "BY"). 
        result = div.find('table').find("div", class_="fonts-gt").get_text("", strip=True).replace("  ", " ") # <== Here 
      else: 
        # Get the "date" and clear the other variables: 
        temp_date = div.get_text("", strip=True)
        team_a = ""
        team_b = ""
        result = ""
    # Add elements to the list:
    if (len(team_a) > 0):
      result_array.append({"Team A" : team_a, "Team B": team_b, "Result": result, "Date": temp_date})

# Remove duplicates - source: https://stackoverflow.com/a/9427216/12511801
result_array = [i for n, i in enumerate(result_array) if i not in result_array[n   1:]]

# Build and display dataframe:
df_final = pd.DataFrame(result_array)
display(df_final)

Result:

index Team A Team B Result Date
0 Halstead CC, Essex-1st XI Fakenham CC-1st XI FAKENHAM CC WON BY 8 WICKETS Saturday 17 September 2022
1 Battisford & District CC-Saturday 1st XI Halstead CC, Essex-2nd XI CANCELLED Saturday 10 September 2022
2 Maldon CC-4th XI Halstead CC, Essex-3rd XI CANCELLED Saturday 10 September 2022
3 Halstead CC, Essex-1st XI Worlington CC-1st XI HALSTEAD CC, ESSEX WON BY 23 RUNS Saturday 10 September 2022
4 Halstead CC, Essex-2nd XI Dunmow CC-2nd XI HALSTEAD CC, ESSEX WON BY 25 RUNS Saturday 03 September 2022
5 Mistley CC-1st XI Halstead CC, Essex-1st XI HALSTEAD CC, ESSEX WON BY 4 WICKETS Saturday 03 September 2022
6 Halstead CC, Essex-NECL T20 Coggeshall Town CC-NECL T20 HALSTEAD CC, ESSEX WON BY 90 RUNS Sunday 28 August 2022
7 Halstead CC, Essex-NECL T20 Mistley CC-NECL T20 HALSTEAD CC, ESSEX WON BY 152 RUNS Sunday 28 August 2022
8 Halstead CC, Essex-2nd XI Stowupland CC-1st XI HALSTEAD CC, ESSEX WON BY 7 WICKETS Saturday 27 August 2022
9 Halstead CC, Essex-3rd XI Real Oddies CC-1st XI HALSTEAD CC, ESSEX WON BY 6 WICKETS Saturday 27 August 2022
10 Kesgrave CC-1st XI Halstead CC, Essex-1st XI HALSTEAD CC, ESSEX WON BY 7 WICKETS Saturday 27 August 2022
11 Sudbury CC, Suffolk-Sunday 1st XI Halstead CC, Essex-NECL 1st XI SUDBURY CC, SUFFOLK WON BY 22 RUNS Sunday 21 August 2022
12 St Margaret's CC, Suffolk-1st XI Halstead CC, Essex-2nd XI ST MARGARET'S CC, SUFFOLK WON BY 60 RUNS Saturday 20 August 2022
13 Kelvedon and Feering CC-2nd XI Halstead CC, Essex-3rd XI KELVEDON AND FEERING CC WON BY 1 WICKET Saturday 20 August 2022
14 Halstead CC, Essex-1st XI Woolpit CC-Saturday 1st XI HALSTEAD CC, ESSEX WON BY 6 WICKETS Saturday 20 August 2022
15 Halstead CC, Essex-2nd XI Mildenhall CC, Suffolk-4th XI HALSTEAD CC, ESSEX WON BY 8 WICKETS Saturday 13 August 2022
16 Halstead CC, Essex-3rd XI Witham CC-3rd XI HALSTEAD CC, ESSEX WON BY 9 WICKETS Saturday 13 August 2022
17 Wivenhoe Town CC-1st XI Halstead CC, Essex-1st XI HALSTEAD CC, ESSEX WON BY 3 WICKETS Saturday 13 August 2022
18 Halstead CC, Essex-NECL 1st XI Colchester and East Essex CC-NECL 1st XI HALSTEAD CC, ESSEX WON BY 5 WICKETS Sunday 07 August 2022
19 Ipswich CC-1st XI Halstead CC, Essex-2nd XI HALSTEAD CC, ESSEX WON BY 1 RUN Saturday 06 August 2022
20 Halstead CC, Essex-1st XI Clacton On Sea CC-Saturday XI HALSTEAD CC, ESSEX WON BY 9 WICKETS Saturday 06 August 2022
21 Halstead CC, Essex-NECL 1st XI Wickham St Pauls CC-1st XI WICKHAM ST PAULS CC WON BY 2 WICKETS Sunday 31 July 2022
22 Halstead CC, Essex-2nd XI West Bergholt CC-Two Counties 1st XI HALSTEAD CC, ESSEX WON BY 3 RUNS Saturday 30 July 2022
23 Halstead CC, Essex-3rd XI Abberton & District CC-3rd XI ABBERTON & DISTRICT CC WON BY 6 WICKETS Saturday 30 July 2022
24 Coggeshall Town CC-1st XI Halstead CC, Essex-1st XI HALSTEAD CC, ESSEX WON BY 117 RUNS Saturday 30 July 2022
  • Related