I have been able to scrape the other data without issues, also I can scrape the url links using the code below.
response2 = requests.get(url2)
soup = BeautifulSoup(response2.text, 'lxml')
for link in soup.find_all('a', href=True):
print(link['href'])
However I now face two challenges:
1- I am only interested in the URL as highlighted for each line (the event link)
2- how do I use these links to scrape the data from each page in turn (the same as if I setup a new code for each of the links replacing the url of urlfix)
urlfix = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
responsefix = requests.get(urlfix)
dffix = pd.read_html(responsefix.text)[0]
#remove times and other data
dffix.drop('Time', axis=1, inplace=True)
dffix.drop('Time.1', axis=1, inplace=True)
dffix.drop('Competitors', axis=1, inplace=True)
#rename columns
dffix.rename(columns = {dffix.columns[3] : 'Win_M'}, inplace = True)
dffix.rename(columns = {dffix.columns[4] : 'Win_F'}, inplace = True)
#filter for event
dffix['Worldchamps']=dffix['Event'].str.contains(r'World Championships', na=True)
dffix['Worldcup']=dffix['Event'].str.contains(r'World Cup', na=True)
#this line for do no contain , | for two
dffix['Miscrace']=~dffix['Event'].str.contains(r'World Championships|World Cup', na=True)
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
print(dffix)
CodePudding user response:
To get event link only use CSS selector .future td:nth-child(2) a
for link in soup.select('.future td:nth-child(2) a'):
print(link['href'], link.text)
CodePudding user response:
Note: For future questions - There should only be one issue per question to keep focus - Every other is predestined to ask a new question.
Just to point in a direction, select your elements more specific and be aware you have to concat the href
with an baseUrl.
Following list comprehension
will create a list of urls you can use to iterate and fetch the detail tables - Used css selectors
to select each row in the tbody
of the table with id T1
and concat the href
of each first <a>
in row with baseUrl:
['https://www.rootsandrain.com' row.a['href'] for row in soup.select('#T1 tbody tr')]
Keep in mind that there is also a paging, there are detail pages without results,... - If you stuck there ask a new question and please provide also expected output. Thanks
Example
url = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
response = requests.get(url)
soup = BeautifulSoup(response.content)
urlList = ['https://www.rootsandrain.com' row.a['href'] for row in soup.select('#T1 tbody tr')]
data = []
for url in urlList:
try:
data.append(pd.read_html(url)[0])
except:
print(f'No tables found:{url}')
pd.concat(data)
Output
...
No tables found:https://www.rootsandrain.com/event9599/2022-jul-9-mercedes-benz-uci-world-cup-dh-4-lenzerheide/
No tables found:https://www.rootsandrain.com/event9598/2022-jun-11-mercedes-benz-uci-world-cup-dh-3-leogang/
No tables found:https://www.rootsandrain.com/event9597/2022-may-22-mercedes-benz-uci-world-cup-dh-2-fort-william/
...
Unnamed: 0 | Pos⇧ | Bib | Name | Unnamed: 4 | Licence | YoB | Sponsors | km/h | sector1 | sector2 | sector3 | sector4 | sector5 = | Qualifier | km/h.1 | sector1 .1 | sector2 .1 | sector3 .1 | sector4 .1 | sector5 =.1 | Run 1 | Diff | sector3 = | sector3 =.1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
nan | 1st | 3 | Loïc BRUNI | nan | 1.00075e 10 | 1994 | Specialized Gravity | 57.781 | 28.973s1 | 1:08.4101 | 40.922s1 | 31.328s6 | 24.900s11 | 3:14.5331 | 59.062 | 28.697s1 | 1:08.8755 | 40.703s1 | 31.067s16 | 24.037s3 | 3:13.3791 | - | nan | nan |
nan | 2nd | 7 | Troy BROSNAN | nan | 1.00073e 10 | 1993 | Canyon Collective Factory Team | 56.258 | 29.331s8 | 1:09.1763 | 42.676s6 | 30.488s2 | 24.493s2 | 3:16.1643 | 59.023 | 29.008s5 | 1:09.40313 | 41.363s8 | 30.121s2 | 23.905s2 | 3:13.8002 | 0.421s | nan | nan |
nan | 3rd | 16 | Ángel SUÁREZ ALONSO | nan | 1.00088e 10 | 1995 | COMMENCAL 21 | 54.1939 | 30.077s26 | 1:18.27071 | 1:16.68773 | 2:00.79772 | 26.728s67 | 5:32.55972 | 58.067 | 28.991s4 | 1:09.2669 | 41.973s16 | 29.531s1 | 24.249s7 | 3:14.0103 | 0.631s | nan | nan |