Scraping URL links in a table-CodePudding

I have been able to scrape the other data without issues, also I can scrape the url links using the code below.

response2 = requests.get(url2)
soup = BeautifulSoup(response2.text, 'lxml')


for link in soup.find_all('a', href=True):
    print(link['href'])

However I now face two challenges:

1- I am only interested in the URL as highlighted for each line (the event link)

2- how do I use these links to scrape the data from each page in turn (the same as if I setup a new code for each of the links replacing the url of urlfix)

urlfix = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
responsefix = requests.get(urlfix)
dffix = pd.read_html(responsefix.text)[0]

#remove times and other data
dffix.drop('Time', axis=1, inplace=True)  
dffix.drop('Time.1', axis=1, inplace=True)  
dffix.drop('Competitors', axis=1, inplace=True)  

#rename columns
dffix.rename(columns = {dffix.columns[3] : 'Win_M'}, inplace = True)
dffix.rename(columns = {dffix.columns[4] : 'Win_F'}, inplace = True)


#filter for event
dffix['Worldchamps']=dffix['Event'].str.contains(r'World Championships', na=True)
dffix['Worldcup']=dffix['Event'].str.contains(r'World Cup', na=True)
#this line for do no contain , | for two
dffix['Miscrace']=~dffix['Event'].str.contains(r'World Championships|World Cup', na=True)


with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(dffix)

Screenshot of the webpage

CodePudding user response：

To get event link only use CSS selector .future td:nth-child(2) a

for link in soup.select('.future  td:nth-child(2) a'):
    print(link['href'], link.text)

CodePudding user response：

Note: For future questions - There should only be one issue per question to keep focus - Every other is predestined to ask a new question.

Just to point in a direction, select your elements more specific and be aware you have to concat the href with an baseUrl.

Following list comprehension will create a list of urls you can use to iterate and fetch the detail tables - Used css selectors to select each row in the tbody of the table with id T1 and concat the href of each first <a> in row with baseUrl:

['https://www.rootsandrain.com' row.a['href'] for row in soup.select('#T1 tbody tr')]

Keep in mind that there is also a paging, there are detail pages without results,... - If you stuck there ask a new question and please provide also expected output. Thanks

Example

url = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
response = requests.get(url)
soup = BeautifulSoup(response.content)

urlList = ['https://www.rootsandrain.com' row.a['href'] for row in soup.select('#T1 tbody tr')]

data = []

for url in urlList:
    try:
        data.append(pd.read_html(url)[0])
    except:
        print(f'No tables found:{url}')

pd.concat(data)

Output

...
No tables found:https://www.rootsandrain.com/event9599/2022-jul-9-mercedes-benz-uci-world-cup-dh-4-lenzerheide/
No tables found:https://www.rootsandrain.com/event9598/2022-jun-11-mercedes-benz-uci-world-cup-dh-3-leogang/
No tables found:https://www.rootsandrain.com/event9597/2022-may-22-mercedes-benz-uci-world-cup-dh-2-fort-william/
...

Unnamed: 0	Pos⇧	Bib	Name	Unnamed: 4	Licence	YoB	Sponsors	km/h	sector1	sector2	sector3	sector4	sector5 =	Qualifier	km/h.1	sector1 .1	sector2 .1	sector3 .1	sector4 .1	sector5 =.1	Run 1	Diff	sector3 =	sector3 =.1
nan	1st	3	Loïc BRUNI	nan	1.00075e 10	1994	Specialized Gravity	57.781	28.973s1	1:08.4101	40.922s1	31.328s6	24.900s11	3:14.5331	59.062	28.697s1	1:08.8755	40.703s1	31.067s16	24.037s3	3:13.3791	-	nan	nan
nan	2nd	7	Troy BROSNAN	nan	1.00073e 10	1993	Canyon Collective Factory Team	56.258	29.331s8	1:09.1763	42.676s6	30.488s2	24.493s2	3:16.1643	59.023	29.008s5	1:09.40313	41.363s8	30.121s2	23.905s2	3:13.8002	0.421s	nan	nan
nan	3rd	16	Ángel SUÁREZ ALONSO	nan	1.00088e 10	1995	COMMENCAL 21	54.1939	30.077s26	1:18.27071	1:16.68773	2:00.79772	26.728s67	5:32.55972	58.067	28.991s4	1:09.2669	41.973s16	29.531s1	24.249s7	3:14.0103	0.631s	nan	nan