Home > Back-end >  Xpath scrapes data from all tables rather than the one I intend to
Xpath scrapes data from all tables rather than the one I intend to

Time:12-02

With the help of answers to the question: Python: Get html table data by xpath, I am trying scrape "Shareholding Pattern" information from a webpage. Here is the code:

import lxml.html as LH
import pprint
import requests

def screenerdata (symbol):
    with requests.Session() as sess:
        resp = sess.get('https://www.screener.in/company/' symbol '/consolidated/')
        root= LH.fromstring(resp.content)

        for tbody in root.xpath('/html/body/main/section[9]/div[2]/table/tbody'):
            data = [ [tdata.text_content().replace(u'\xa0', u'').strip()
                     for tdata in trow.xpath('td')]
                     for trow in tbody.xpath('//tr') ]
        pprint.pprint(data)

screenerdata("LTTS")

Since the html table on the webpage doesn't have any id or class, I copied the xpath using Mozilla Firefox web developer tool. Everything works great, except that the code scrapes data from other tables too. Any ideas about how to fix this problem. Thanks in advance

CodePudding user response:

Do you have to access through xpath? Since it's a <table> tag, why not let pandas parse the tables? It will return a list of dataframes (essentitally each <table> tag in the html. The last table is the "Shareholding Pattern" so can just use the index of the df list.

import pandas as pd

def screenerdata (symbol):
    url = 'https://www.screener.in/company/' symbol '/consolidated/'
    df = pd.read_html(url)[-1]
    print(df.to_string())

screenerdata("LTTS")

Output:

Unnamed: 0  Dec 2018  Mar 2019  Jun 2019  Sep 2019  Dec 2019  Mar 2020  Jun 2020  Sep 2020  Dec 2020  Mar 2021  Jun 2021  Sep 2021
0  Promoters       80.41     78.88     74.97     74.97     74.74     74.62     74.60     74.36     74.27     74.24     74.23     74.15
1       FIIs        4.22      5.09      8.50      8.93      8.26      8.37      8.95      7.97      8.87      9.06      8.92      9.50
2       DIIs        4.25      4.43      4.75      4.76      4.52      4.88      4.45      5.83      6.40      6.36      6.68      6.14
3     Public       11.12     11.60     11.78     11.34     12.48     12.13     12.00     11.83     10.46     10.34     10.17     10.21

CodePudding user response:

This line is the problem:

for trow in tbody.xpath('//tr') ]

You are "jumping up" to the top of the XML tree and then looking down through the entire document for any and all tr elements.

You should make that a relative expression .//tr instead of //tr. That will look for any and all tr starting from the current position (the selected tbody).

  • Related