Home > other >  Scraping URL links in a table
Scraping URL links in a table


I have been able to scrape the other data without issues, also I can scrape the url links using the code below.

response2 = requests.get(url2)
soup = BeautifulSoup(response2.text, 'lxml')

for link in soup.find_all('a', href=True):

However I now face two challenges:

1- I am only interested in the URL as highlighted for each line (the event link)

2- how do I use these links to scrape the data from each page in turn (the same as if I setup a new code for each of the links replacing the url of urlfix)

urlfix = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
responsefix = requests.get(urlfix)
dffix = pd.read_html(responsefix.text)[0]

#remove times and other data
dffix.drop('Time', axis=1, inplace=True)  
dffix.drop('Time.1', axis=1, inplace=True)  
dffix.drop('Competitors', axis=1, inplace=True)  

#rename columns
dffix.rename(columns = {dffix.columns[3] : 'Win_M'}, inplace = True)
dffix.rename(columns = {dffix.columns[4] : 'Win_F'}, inplace = True)

#filter for event
dffix['Worldchamps']=dffix['Event'].str.contains(r'World Championships', na=True)
dffix['Worldcup']=dffix['Event'].str.contains(r'World Cup', na=True)
#this line for do no contain , | for two
dffix['Miscrace']=~dffix['Event'].str.contains(r'World Championships|World Cup', na=True)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also

Screenshot of the webpage

CodePudding user response:

To get event link only use CSS selector .future td:nth-child(2) a

for link in soup.select('.future  td:nth-child(2) a'):
    print(link['href'], link.text)

CodePudding user response:

Note: For future questions - There should only be one issue per question to keep focus - Every other is predestined to ask a new question.

Just to point in a direction, select your elements more specific and be aware you have to concat the href with an baseUrl.

Following list comprehension will create a list of urls you can use to iterate and fetch the detail tables - Used css selectors to select each row in the tbody of the table with id T1 and concat the href of each first <a> in row with baseUrl:

['https://www.rootsandrain.com' row.a['href'] for row in soup.select('#T1 tbody tr')]

Keep in mind that there is also a paging, there are detail pages without results,... - If you stuck there ask a new question and please provide also expected output. Thanks


url = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
response = requests.get(url)
soup = BeautifulSoup(response.content)

urlList = ['https://www.rootsandrain.com' row.a['href'] for row in soup.select('#T1 tbody tr')]

data = []

for url in urlList:
        print(f'No tables found:{url}')



No tables found:https://www.rootsandrain.com/event9599/2022-jul-9-mercedes-benz-uci-world-cup-dh-4-lenzerheide/
No tables found:https://www.rootsandrain.com/event9598/2022-jun-11-mercedes-benz-uci-world-cup-dh-3-leogang/
No tables found:https://www.rootsandrain.com/event9597/2022-may-22-mercedes-benz-uci-world-cup-dh-2-fort-william/
Unnamed: 0 Pos⇧ Bib Name Unnamed: 4 Licence YoB Sponsors km/h sector1 sector2 sector3 sector4 sector5 = Qualifier km/h.1 sector1 .1 sector2 .1 sector3 .1 sector4 .1 sector5 =.1 Run 1 Diff sector3 = sector3 =.1
nan 1st 3 Loïc BRUNI nan 1.00075e 10 1994 Specialized Gravity 57.781 28.973s1 1:08.4101 40.922s1 31.328s6 24.900s11 3:14.5331 59.062 28.697s1 1:08.8755 40.703s1 31.067s16 24.037s3 3:13.3791 - nan nan
nan 2nd 7 Troy BROSNAN nan 1.00073e 10 1993 Canyon Collective Factory Team 56.258 29.331s8 1:09.1763 42.676s6 30.488s2 24.493s2 3:16.1643 59.023 29.008s5 1:09.40313 41.363s8 30.121s2 23.905s2 3:13.8002 0.421s nan nan
nan 3rd 16 Ángel SUÁREZ ALONSO nan 1.00088e 10 1995 COMMENCAL 21 54.1939 30.077s26 1:18.27071 1:16.68773 2:00.79772 26.728s67 5:32.55972 58.067 28.991s4 1:09.2669 41.973s16 29.531s1 24.249s7 3:14.0103 0.631s nan nan
  • Related