import requests
from bs4 import BeautifulSoup
import pandas as pd
session = requests.Session()
session.verify = False
session.trust_env = False
url = 'https://basketball.realgm.com/nba/teams'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
teams = soup.findAll('div',{'class':'small-column-left'})
for team in teams:
name = team.get_text().strip()
schedule_url = team.get('a[href]')
print(name)
i get the result as Atlanta Hawks
Roster | Schedule | Stats
Charlotte Hornets
Roster | Schedule | Stats
Miami Heat
Roster | Schedule | Stats
Orlando Magic
Roster | Schedule | Stats
Washington Wizards
Roster | Schedule | Stats: https://basketball.realgm.com/nba/teams/Atlanta-Hawks/1/Home Northwest Division
Denver Nuggets
Roster | Schedule | Stats
Minnesota Timberwolves
Roster | Schedule | Stats
Oklahoma City Thunder
Roster | Schedule | Stats
Portland Trail Blazers
Roster | Schedule | Stats
Utah Jazz
Roster | Schedule | Stats: https://basketball.realgm.com/nba/teams/Denver-Nuggets/7/Home Pacific Division
but i want url for schedule only which are behind the clickable text
CodePudding user response:
In newer code avoid old syntax findAll()
instead use find_all()
or select()
with css selectors
- For more take a minute to check docs*
So select your elements more specific for example with css selectors
:
[(t.parent.a.text,'https://basketball.realgm.com' t.get('href')) for t in soup.select('.basketball a[href*="Schedule"]')]
Exmaple
import requests
from bs4 import BeautifulSoup
session = requests.Session()
session.verify = False
session.trust_env = False
url = 'https://basketball.realgm.com/nba/teams'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
[(t.parent.a.text,'https://basketball.realgm.com' t.get('href')) for t in soup.select('.basketball a[href*="Schedule"]')]
Output
['https://basketball.realgm.com/nba/teams/Boston-Celtics/2/Schedule/2023',
'https://basketball.realgm.com/nba/teams/Brooklyn-Nets/38/Schedule/2023',
'https://basketball.realgm.com/nba/teams/New-York-Knicks/20/Schedule/2023',
'https://basketball.realgm.com/nba/teams/Philadelphia-Sixers/22/Schedule/2023',
'https://basketball.realgm.com/nba/teams/Toronto-Raptors/28/Schedule/2023',
'https://basketball.realgm.com/nba/teams/Chicago-Bulls/4/Schedule/2023',
'https://basketball.realgm.com/nba/teams/Cleveland-Cavaliers/5/Schedule/2023',
'https://basketball.realgm.com/nba/teams/Detroit-Pistons/8/Schedule/2023',
'https://basketball.realgm.com/nba/teams/Indiana-Pacers/11/Schedule/2023',
'https://basketball.realgm.com/nba/teams/Milwaukee-Bucks/16/Schedule/2023',
'https://basketball.realgm.com/nba/teams/Atlanta-Hawks/1/Schedule/2023',...]
EDIT
Based on additionally comment, simply create a list
of dicts
and convert it to dataframe
:
pd.DataFrame(
[{'team':t.parent.a.text,'url':'https://basketball.realgm.com' t.get('href')} for t in soup.select('.basketball a[href*="Schedule"]')]
)