How to extract data from div tag, when the div class name is dynamic using python?-CodePudding

I am scraping the website tickertapeenter link description here,to extract information about the product. Expected outcome after parsing the website.

Issue i am facing,div class information is very dynamic <div data-section-tag="key-metrics" class="jsx-382396230 ratios-card sp-card"><h2 class="jsx-382396230"><span class="jsx-382396230 content">Key Metrics</span></h2><div class="jsx-382396230 stats"><div class="jsx-1785027547 statbox "><div><div class="title font-medium text-dark text-14 pointer"><span class="jsx-559150734 key-ratio-title relative"><span class="jsx-559150734 ellipsis desktop--only">Realtime NAV</span><span class="jsx-559150734 ellipsis mob--only">Realtime NAV</span><div class="jsx-324047672 tooltip-root arrow-bottom arrow-left content-top content-left font-regular text-13 lh-138" style="color: rgb(255, 255, 255);"><h4 class="jsx-559150734 tooltip-head mb4 font-medium">Realtime NAV</h4><p class="jsx-559150734 lh-138">Value of each share's portion of the underlying assets and cash</p></div></span></div><div class="value text-15 ellipsis">₹ 181.73</div></div><div><div class="title font-medium text-dark text-14 pointer"><span class="jsx-559150734 key-ratio-title relative"><span class="jsx-559150734 ellipsis desktop--only">AUM</span><span class="jsx-559150734 ellipsis text-center mob--only">AUM</span><div class="jsx-324047672 tooltip-root arrow-bottom arrow-middle content-top content-middle font-regular text-13 lh-138" style="color: rgb(255, 255, 255);"><h4 class="jsx-559150734 tooltip-head mb4 font-medium">AUM</h4><p class="jsx-559150734 lh-138">The total market value of funds managed by the Asset Management Company</p></div></span></div><div class="value text-15 ellipsis">₹ 1,335.35cr</div></div><div><div class="title font-medium text-dark text-14 pointer"><span class="jsx-559150734 key-ratio-title relative"><span class="jsx-559150734 ellipsis desktop--only">Expense Ratio</span><span class="jsx-559150734 ellipsis text-right mob--only">Expense Ratio</span><div class="jsx-324047672 tooltip-root font-regular text-13 lh-138" style="color: rgb(255, 255, 255);"><h4 class="jsx-559150734 tooltip-head mb4 font-medium">Expense Ratio</h4><p class="jsx-559150734 lh-138">The operating and administrative costs of running the fund measured as the percentage of fund assets</p></div></span></div><div class="value text-15 ellipsis">0.12%</div></div><div><div class="title font-medium text-dark text-14 pointer"><span class="jsx-559150734 key-ratio-title relative"><span class="jsx-559150734 ellipsis desktop--only">Category Exp Ratio</span><span class="jsx-559150734 ellipsis mob--only">Cat. Expense Rat.</span><div class="jsx-324047672 tooltip-root arrow-bottom arrow-left content-top content-left font-regular text-13 lh-138" style="color: rgb(255, 255, 255);"><h4 class="jsx-559150734 tooltip-head mb4 font-medium">Category Exp Ratio</h4><p class="jsx-559150734 lh-138">Average of the operating and administrative costs of running ETFs of the same sector measured as the percentage of fund assets</p></div></span></div><div class="value text-15 ellipsis">0.22%</div></div><div><div class="title font-medium text-dark text-14 pointer"><span class="jsx-559150734 key-ratio-title relative"><span class="jsx-559150734 ellipsis desktop--only">Tracking Error</span><span class="jsx-559150734 ellipsis text-center mob--only">Tracking Error</span><div class="jsx-324047672 tooltip-root font-regular text-13 lh-138" style="color: rgb(255, 255, 255);"><h4 class="jsx-559150734 tooltip-head mb4 font-medium">Tracking Error</h4><p class="jsx-559150734 lh-138">The difference between the performance of the security and the benchmark index that it tracks</p></div></span></div><div class="value text-15 ellipsis">0.08%</div></div><div><div class="title font-medium text-dark text-14 pointer"><span class="jsx-559150734 key-ratio-title relative"><span class="jsx-559150734 ellipsis desktop--only">Category Tracking Err</span><span class="jsx-559150734 ellipsis text-right mob--only">Cat. Tracking Err.</span><div class="jsx-324047672 tooltip-root font-regular text-13 lh-138" style="color: rgb(255, 255, 255);"><h4 class="jsx-559150734 tooltip-head mb4 font-medium">Category Tracking Err</h4><p class="jsx-559150734 lh-138">Average of the difference between the performance of the ETF's peers and the benchmark index that it tracks</p></div></span></div><div class="value text-15 ellipsis">0.27%</div></div></div></div></div>

Code i developed to extract information

from bs4 import BeautifulSoup as bs



s=requests.Session()
response=s.get('https://www.tickertape.in/etfs/kotak-nifty-50-etf-KOTK')
soup = bs(response.text,'html.parser')
res=soup.find("div",{"data-section-tag":"key-metrics"}).get_text();


#To get the AUM value
#AUM_location is added by 7 since AUM is repeating and want to remove the symbol ₹ 
print("The AUM value",res[res.find('AUM') ((len('AUM')*2) 1):res.find('Expense Ratio')])

#To get the Expense ratio
print("The Expense ratio",res[res.find('Expense Ratio') (len('Expense Ratio')*2):res.find('Sector Expense')])

#To get the tracking error
print("The Tracking Error",res[res.find('Tracking Error') (len('Tracking Error')*2):res.find('Sector Tracking Error')])

#Close the connection
s.close()

Currently i am extracting the text and splitting the array based on the length

Is there better way to extract the information ?

CodePudding user response：

I would extract the JS object housing all the page data, within a script tag, and parse with json package, then extract your desired values:

import re, json, requests

response = requests.get('https://www.tickertape.in/etfs/kotak-nifty-50-etf-KOTK')
data = json.loads(re.search(r'(\{"props".*\})', response.text).group(1))
ratios = data['props']['pageProps']['securityInfo']['ratios']
print("The AUM value", '{:.2f}'.format(ratios['asstUnderMan'])) 
print("The Expense ratio", '{:.2%}'.format(ratios['expenseRatio']/100))
print("The Tracking Error", '{:.2%}'.format(ratios['trackErr']/100))

CodePudding user response：

I'm getting desired output. I use only scrapy for the purpose of applying xpath. Because xpath help me easily to grab data.

Code:

import scrapy

class Ticker(scrapy.Spider):
    name = 'ticker'
    start_urls = ["https://www.tickertape.in/etfs/kotak-nifty-50-etf-KOTK"]

    def parse(self, response):
        yield {
            'Realtime NAV':  response.xpath('(//div[@class="value   text-15 ellipsis"])[1]/text()').get(),
            'AUM':  response.xpath('(//div[@class="value   text-15 ellipsis"])[2]/text()').get(),
            'Expense Ratio':  response.xpath('(//div[@class="value   text-15 ellipsis"])[3]/text()').get(),
            'Sctr Expense Ratio':  response.xpath('(//div[@class="value   text-15 ellipsis"])[4]/text()').get(),
            'Tracking Error':  response.xpath('(//div[@class="value   text-15 ellipsis"])[5]/text()').get(),
            'Sctr Tracking Error':  response.xpath('(//div[@class="value   text-15 ellipsis"])[6]/text()').get()
            }

Output in scrapy:

{'Realtime NAV': '₹ 181.56', 'AUM': '₹ 1,463.42cr', 'Expense Ratio': '0.12%', 'Sctr Expense Ratio': '0.22%', 'Tracking Error': '0.08%', 'Sctr Tracking Error': '0.26%'}

Output in csv:

Realtime NAV    AUM  Expense Ratio  Sctr Expense Ratio  Tracking Error  Sctr Tracking Error
   ₹ 181.56   ₹ 1,463.42cr      0.12%       0.22%       0.08%              0.26%