Home > database >  How to extract data from div tag, when the div class name is dynamic using python?
How to extract data from div tag, when the div class name is dynamic using python?

Time:09-22

I am scraping the website tickertapeenter link description here,to extract information about the product. Expected outcome after parsing the website.

Issue i am facing,div class information is very dynamic <div data-section-tag="key-metrics" class="jsx-382396230 ratios-card sp-card"><h2 class="jsx-382396230"><span class="jsx-382396230 content">Key Metrics</span></h2><div class="jsx-382396230 stats"><div class="jsx-1785027547 statbox "><div><div class="title font-medium text-dark text-14 pointer"><span class="jsx-559150734 key-ratio-title relative"><span class="jsx-559150734 ellipsis desktop--only">Realtime NAV</span><span class="jsx-559150734 ellipsis mob--only">Realtime NAV</span><div class="jsx-324047672 tooltip-root arrow-bottom arrow-left content-top content-left font-regular text-13 lh-138" style="color: rgb(255, 255, 255);"><h4 class="jsx-559150734 tooltip-head mb4 font-medium">Realtime NAV</h4><p class="jsx-559150734 lh-138">Value of each share's portion of the underlying assets and cash</p></div></span></div><div class="value text-15 ellipsis">₹ 181.73</div></div><div><div class="title font-medium text-dark text-14 pointer"><span class="jsx-559150734 key-ratio-title relative"><span class="jsx-559150734 ellipsis desktop--only">AUM</span><span class="jsx-559150734 ellipsis text-center mob--only">AUM</span><div class="jsx-324047672 tooltip-root arrow-bottom arrow-middle content-top content-middle font-regular text-13 lh-138" style="color: rgb(255, 255, 255);"><h4 class="jsx-559150734 tooltip-head mb4 font-medium">AUM</h4><p class="jsx-559150734 lh-138">The total market value of funds managed by the Asset Management Company</p></div></span></div><div class="value text-15 ellipsis">₹ 1,335.35cr</div></div><div><div class="title font-medium text-dark text-14 pointer"><span class="jsx-559150734 key-ratio-title relative"><span class="jsx-559150734 ellipsis desktop--only">Expense Ratio</span><span class="jsx-559150734 ellipsis text-right mob--only">Expense Ratio</span><div class="jsx-324047672 tooltip-root font-regular text-13 lh-138" style="color: rgb(255, 255, 255);"><h4 class="jsx-559150734 tooltip-head mb4 font-medium">Expense Ratio</h4><p class="jsx-559150734 lh-138">The operating and administrative costs of running the fund measured as the percentage of fund assets</p></div></span></div><div class="value text-15 ellipsis">0.12%</div></div><div><div class="title font-medium text-dark text-14 pointer"><span class="jsx-559150734 key-ratio-title relative"><span class="jsx-559150734 ellipsis desktop--only">Category Exp Ratio</span><span class="jsx-559150734 ellipsis mob--only">Cat. Expense Rat.</span><div class="jsx-324047672 tooltip-root arrow-bottom arrow-left content-top content-left font-regular text-13 lh-138" style="color: rgb(255, 255, 255);"><h4 class="jsx-559150734 tooltip-head mb4 font-medium">Category Exp Ratio</h4><p class="jsx-559150734 lh-138">Average of the operating and administrative costs of running ETFs of the same sector measured as the percentage of fund assets</p></div></span></div><div class="value text-15 ellipsis">0.22%</div></div><div><div class="title font-medium text-dark text-14 pointer"><span class="jsx-559150734 key-ratio-title relative"><span class="jsx-559150734 ellipsis desktop--only">Tracking Error</span><span class="jsx-559150734 ellipsis text-center mob--only">Tracking Error</span><div class="jsx-324047672 tooltip-root font-regular text-13 lh-138" style="color: rgb(255, 255, 255);"><h4 class="jsx-559150734 tooltip-head mb4 font-medium">Tracking Error</h4><p class="jsx-559150734 lh-138">The difference between the performance of the security and the benchmark index that it tracks</p></div></span></div><div class="value text-15 ellipsis">0.08%</div></div><div><div class="title font-medium text-dark text-14 pointer"><span class="jsx-559150734 key-ratio-title relative"><span class="jsx-559150734 ellipsis desktop--only">Category Tracking Err</span><span class="jsx-559150734 ellipsis text-right mob--only">Cat. Tracking Err.</span><div class="jsx-324047672 tooltip-root font-regular text-13 lh-138" style="color: rgb(255, 255, 255);"><h4 class="jsx-559150734 tooltip-head mb4 font-medium">Category Tracking Err</h4><p class="jsx-559150734 lh-138">Average of the difference between the performance of the ETF's peers and the benchmark index that it tracks</p></div></span></div><div class="value text-15 ellipsis">0.27%</div></div></div></div></div>

Code i developed to extract information

from bs4 import BeautifulSoup as bs



s=requests.Session()
response=s.get('https://www.tickertape.in/etfs/kotak-nifty-50-etf-KOTK')
soup = bs(response.text,'html.parser')
res=soup.find("div",{"data-section-tag":"key-metrics"}).get_text();


#To get the AUM value
#AUM_location is added by 7 since AUM is repeating and want to remove the symbol ₹ 
print("The AUM value",res[res.find('AUM') ((len('AUM')*2) 1):res.find('Expense Ratio')])

#To get the Expense ratio
print("The Expense ratio",res[res.find('Expense Ratio') (len('Expense Ratio')*2):res.find('Sector Expense')])

#To get the tracking error
print("The Tracking Error",res[res.find('Tracking Error') (len('Tracking Error')*2):res.find('Sector Tracking Error')])

#Close the connection
s.close()

Currently i am extracting the text and splitting the array based on the length

Is there better way to extract the information ?

CodePudding user response:

I would extract the JS object housing all the page data, within a script tag, and parse with json package, then extract your desired values:

import re, json, requests

response = requests.get('https://www.tickertape.in/etfs/kotak-nifty-50-etf-KOTK')
data = json.loads(re.search(r'(\{"props".*\})', response.text).group(1))
ratios = data['props']['pageProps']['securityInfo']['ratios']
print("The AUM value", '{:.2f}'.format(ratios['asstUnderMan'])) 
print("The Expense ratio", '{:.2%}'.format(ratios['expenseRatio']/100))
print("The Tracking Error", '{:.2%}'.format(ratios['trackErr']/100))

CodePudding user response:

I'm getting desired output. I use only scrapy for the purpose of applying xpath. Because xpath help me easily to grab data.

Code:

import scrapy

class Ticker(scrapy.Spider):
    name = 'ticker'
    start_urls = ["https://www.tickertape.in/etfs/kotak-nifty-50-etf-KOTK"]

    def parse(self, response):
        yield {
            'Realtime NAV':  response.xpath('(//div[@class="value   text-15 ellipsis"])[1]/text()').get(),
            'AUM':  response.xpath('(//div[@class="value   text-15 ellipsis"])[2]/text()').get(),
            'Expense Ratio':  response.xpath('(//div[@class="value   text-15 ellipsis"])[3]/text()').get(),
            'Sctr Expense Ratio':  response.xpath('(//div[@class="value   text-15 ellipsis"])[4]/text()').get(),
            'Tracking Error':  response.xpath('(//div[@class="value   text-15 ellipsis"])[5]/text()').get(),
            'Sctr Tracking Error':  response.xpath('(//div[@class="value   text-15 ellipsis"])[6]/text()').get()
            }

Output in scrapy:

{'Realtime NAV': '₹ 181.56', 'AUM': '₹ 1,463.42cr', 'Expense Ratio': '0.12%', 'Sctr Expense Ratio': '0.22%', 'Tracking Error': '0.08%', 'Sctr Tracking Error': '0.26%'}

Output in csv:

Realtime NAV    AUM  Expense Ratio  Sctr Expense Ratio  Tracking Error  Sctr Tracking Error181.561,463.42cr      0.12%       0.22%       0.08%              0.26%

    
  • Related