With the help of answers to the question: Python: Get html table data by xpath, I am trying scrape "Shareholding Pattern" information from a webpage. Here is the code:
import lxml.html as LH
import pprint
import requests
def screenerdata (symbol):
with requests.Session() as sess:
resp = sess.get('https://www.screener.in/company/' symbol '/consolidated/')
root= LH.fromstring(resp.content)
for tbody in root.xpath('/html/body/main/section[9]/div[2]/table/tbody'):
data = [ [tdata.text_content().replace(u'\xa0', u'').strip()
for tdata in trow.xpath('td')]
for trow in tbody.xpath('//tr') ]
pprint.pprint(data)
screenerdata("LTTS")
Since the html table on the webpage doesn't have any id or class, I copied the xpath using Mozilla Firefox web developer tool. Everything works great, except that the code scrapes data from other tables too. Any ideas about how to fix this problem. Thanks in advance
CodePudding user response:
Do you have to access through xpath? Since it's a <table>
tag, why not let pandas
parse the tables? It will return a list of dataframes (essentitally each <table>
tag in the html. The last table is the "Shareholding Pattern" so can just use the index of the df list.
import pandas as pd
def screenerdata (symbol):
url = 'https://www.screener.in/company/' symbol '/consolidated/'
df = pd.read_html(url)[-1]
print(df.to_string())
screenerdata("LTTS")
Output:
Unnamed: 0 Dec 2018 Mar 2019 Jun 2019 Sep 2019 Dec 2019 Mar 2020 Jun 2020 Sep 2020 Dec 2020 Mar 2021 Jun 2021 Sep 2021
0 Promoters 80.41 78.88 74.97 74.97 74.74 74.62 74.60 74.36 74.27 74.24 74.23 74.15
1 FIIs 4.22 5.09 8.50 8.93 8.26 8.37 8.95 7.97 8.87 9.06 8.92 9.50
2 DIIs 4.25 4.43 4.75 4.76 4.52 4.88 4.45 5.83 6.40 6.36 6.68 6.14
3 Public 11.12 11.60 11.78 11.34 12.48 12.13 12.00 11.83 10.46 10.34 10.17 10.21
CodePudding user response:
This line is the problem:
for trow in tbody.xpath('//tr') ]
You are "jumping up" to the top of the XML tree and then looking down through the entire document for any and all tr
elements.
You should make that a relative expression .//tr
instead of //tr
. That will look for any and all tr
starting from the current position (the selected tbody
).