I am new to python and looking for some help with web scraping. I am ultimately looking to scrape tennis player current ranking data from coretennis.com. The URL i have been uing the practice on is https://www.coretennis.net/tennis-player/liv-hovde/114585/profile.html
The code i currently have gives me more data than I need, and I am looking for a way to extract only what I need. The code is:
from bs4 import BeautifulSoup
import requests
import smtplib
import time
import datetime
URL = 'https://www.coretennis.net/tennis-player/liv-hovde/114585/profile.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"}
page = requests.get(URL, headers=headers)
soup1 = BeautifulSoup(page.content, "html.parser")
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")
itf_rank = soup2.findAll(class_="rank")
print(itf_rank)
And the output I currently get is:
[<div >
450
<span >
5
</span>
</div>, <div >
3
<span >
0
</span>
</div>]
I am only needing/ wanting to extract the rank 450 and 3. In reality, most players won't have both ranks so I will mainly have just one piece of ranking data (e.g. 3 from above example).
Is anyone able to help?
Thanks in advance
Marc
I have tried to enter differnt pieces of code in to the findAll arguement, but nothing has worked. I was hoping to only scrape the player rank number from the website.
CodePudding user response:
You're almost there, just loop thru the divs
split and get the text value.
For example:
import requests
from bs4 import BeautifulSoup
url = "https://www.coretennis.net/tennis-player/liv-hovde/114585/profile.html"
page = requests.get(url).text
wta, itf = [
i.getText(strip=True, separator="|").split("|")[0] for i
in BeautifulSoup(page, "lxml").select(".pphRankBox .rank")
]
print(f"WTA: {wta}, ITF: {itf}")
Output:
WTA: 450, ITF: 3
CodePudding user response:
The answer of @baduker may potentially be the more elegant way to go. If, however, you want to stick to your own logic, then you may try the following code (just the parsing changed, removed unnecessary imports).
Here, the individual parent div
is parsed as text and split along the emerging newline characters. Depending on whether you need the rank as number or string, you may want to add type casting to the result:
from bs4 import BeautifulSoup
import requests
URL = 'https://www.coretennis.net/tennis-player/liv-hovde/114585/profile.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"}
page = requests.get(URL, headers=headers)
soup1 = BeautifulSoup(page.content, "html.parser")
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")
itf_rank = [item.text.split('\n')[1].strip() for item in soup2.select('.rank')]
print(itf_rank)