Python web scrape with same names div-CodePudding

I am new to python and looking for some help with web scraping. I am ultimately looking to scrape tennis player current ranking data from coretennis.com. The URL i have been uing the practice on is https://www.coretennis.net/tennis-player/liv-hovde/114585/profile.html

The code i currently have gives me more data than I need, and I am looking for a way to extract only what I need. The code is:

from bs4 import BeautifulSoup
import requests
import smtplib
import time
import datetime

URL = 'https://www.coretennis.net/tennis-player/liv-hovde/114585/profile.html'

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101     Firefox/108.0"}

page = requests.get(URL, headers=headers)

soup1 = BeautifulSoup(page.content, "html.parser")

soup2 = BeautifulSoup(soup1.prettify(), "html.parser")


itf_rank = soup2.findAll(class_="rank")
print(itf_rank)

And the output I currently get is:

[<div >
     450
     <span >
      5
     </span>
</div>, <div >
     3
     <span >
      0
     </span>
</div>]

I am only needing/ wanting to extract the rank 450 and 3. In reality, most players won't have both ranks so I will mainly have just one piece of ranking data (e.g. 3 from above example).

Is anyone able to help?

Thanks in advance

Marc

I have tried to enter differnt pieces of code in to the findAll arguement, but nothing has worked. I was hoping to only scrape the player rank number from the website.

CodePudding user response：

You're almost there, just loop thru the divs split and get the text value.

For example:

import requests
from bs4 import BeautifulSoup

url = "https://www.coretennis.net/tennis-player/liv-hovde/114585/profile.html"
page = requests.get(url).text
wta, itf = [
    i.getText(strip=True, separator="|").split("|")[0] for i
    in BeautifulSoup(page, "lxml").select(".pphRankBox .rank")
]
print(f"WTA: {wta}, ITF: {itf}")

Output:

WTA: 450, ITF: 3

CodePudding user response：

The answer of @baduker may potentially be the more elegant way to go. If, however, you want to stick to your own logic, then you may try the following code (just the parsing changed, removed unnecessary imports).

Here, the individual parent div is parsed as text and split along the emerging newline characters. Depending on whether you need the rank as number or string, you may want to add type casting to the result:

from bs4 import BeautifulSoup
import requests

URL = 'https://www.coretennis.net/tennis-player/liv-hovde/114585/profile.html'

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101     Firefox/108.0"}

page = requests.get(URL, headers=headers)

soup1 = BeautifulSoup(page.content, "html.parser")

soup2 = BeautifulSoup(soup1.prettify(), "html.parser")

itf_rank = [item.text.split('\n')[1].strip() for item in soup2.select('.rank')]
print(itf_rank)