Home > Back-end >  How to get first class tag using beautiful soup
How to get first class tag using beautiful soup

Time:10-23

I would like to scrape the fund price and date of the following url: https://www.blackrock.com/sg/en/products/241427/blackrock-cont-european-flexible-d4rf-gbp-fund and put these values in a table:

    Date       Price
21-Oct-2021    36.68

However, in the html source there are many <span class with the same title:

<span class="header-nav-label navAmount">
NAV as of 21-Oct-2021
</span>
<span class="header-nav-data">
GBP 36.68
</span>
<span class="header-nav-data">
0.10
(0.27%)
</span>

But I only want to pick the first class with the daily price in it.

I've tried the following code:

from bs4 import BeautifulSoup
import requests

#Create url list
urls = ['https://www.blackrock.com/sg/en/products/241427/blackrock-cont-european-flexible-d4rf-gbp-fund']

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}

# Build the scrapping loop
for url in urls:
    # Extract HTML element (daily price and date) from url 
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    spans = soup.findAll('span', {'class':'header-nav-data'})
    for span in spans:
       print (span.text)
    spans1 = soup.findAll('span', {'class':'header-nav-label navAmount'})
    print (spans1)

which returns:

GBP 36.8
0.1
(0.27%)
[<span class="header-nav-label navAmount">
NAV as of 21-Oct-2021
</span>]

Do you know what I need to do to only select the first <span class as I'm only interested about the price? I'm new to Python so would greatly appreciate the help. Thanks!

CodePudding user response:

You could also go the path of pulling out the json within the html:

import requests
import re
import json
import pandas as pd

url = 'https://www.blackrock.com/sg/en/products/241427/blackrock-cont-european-flexible-d4rf-gbp-fund'
response = requests.get(url)

regex = r"(var navData = )(\[.*)(;)"
jsonStr = re.search(regex, response.text).groups()[1]
jsonStr = re.sub(r"((x:)(Date.UTC\(\d{4},\d{1,2},\d{1,2}\)),y:Number\({1,2})([\d.]*)([).\s\w\(]*)", r"\2\4", jsonStr)
jsonStr = jsonStr.replace('x:','"y":')
jsonStr = jsonStr.replace('formattedX:','"Date":')
jsonData = json.loads(jsonStr)

df = pd.DataFrame(jsonData)
df = df[['Date','y']]

Output:

to the most recent, just do print(df.tail(1))

print(df)
                  Date      y
0     Thu, 13 Sep 2012   9.81
1     Fri, 14 Sep 2012  10.07
2     Mon, 17 Sep 2012  10.02
3     Tue, 18 Sep 2012   9.94
4     Wed, 19 Sep 2012   9.96
               ...    ...
2275  Fri, 15 Oct 2021  36.30
2276  Mon, 18 Oct 2021  36.43
2277  Tue, 19 Oct 2021  36.48
2278  Wed, 20 Oct 2021  36.58
2279  Thu, 21 Oct 2021  36.68

[2280 rows x 2 columns]

CodePudding user response:

You can use limit=1 doc

from bs4 import BeautifulSoup
import requests

#Create url list
urls = ['https://www.blackrock.com/sg/en/products/241427/blackrock-cont-european-flexible-d4rf-gbp-fund']

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}

# Build the scrapping loop
for url in urls:
    # Extract HTML element (daily price and date) from url 
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    spans = soup.findAll('span', {'class':'header-nav-data'})
    print(spans)
    print('----------------------------')
    spans = soup.findAll('span', {'class':'header-nav-data'}, limit=1)
    print(spans)
    print('---------------------')
    print(spans[0].text)
    # or
    for span in spans:
       print (span.text)
    spans1 = soup.findAll('span', {'class':'header-nav-label navAmount'})
    print (spans1)

CodePudding user response:

According to your question, Here is the working solution using css selectors.

Code:

from bs4 import BeautifulSoup
import requests

# Create url list
urls = ['https://www.blackrock.com/sg/en/products/241427/blackrock-cont-european-flexible-d4rf-gbp-fund']

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}

# Build the scrapping loop
for url in urls:
    # Extract HTML element (daily price and date) from url
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    spans1 = soup.select_one('ul.values-list li span:nth-child(1)').get_text(strip=True).replace('NAV as of', ' ')
    spans2 = soup.select_one('ul.values-list li span:nth-child(2)').get_text(strip=True).replace('GBP', ' ')
    
    print('Date:' spans1)
    print('Price:'  spans2)

Output:

Date:  21-Oct-2021
Price:  36.68

CodePudding user response:

Try using this:

soup.find_all('span', class='header-nav-label navAmount')
  • Related