Home > Software design >  Scrapy All table data not scraped
Scrapy All table data not scraped

Time:07-16

I have been working on this website Everything is fine but I am not able to get 2 more data items they are pClose and Diff in the table. Any Reason for Not being printed When I try to print item at 7 index i.e stock_data[7] I get list index error any reason behind this? Here is my code below

class FloorSheetSpider(scrapy.Spider):
    name = "nepse"
    # allowed_domains = ['nl.indeed.com']

    start_urls = ['https://merolagani.com/LatestMarket.aspx']

    items = []

    def parse(self, response):
        items = NepalLiveShareItem()
        for tr in response.xpath("//table[@class='table table-hover live-trading sortable']//tbody//tr"):
            stock_data = tr.css('td ::text').extract()
            items['symbol'] = stock_data[0]
            items['ltp'] = stock_data[1]
            items['percent_change'] = stock_data[2]
            items['open'] = stock_data[3]
            items['high'] = stock_data[4]
            items['low'] = stock_data[5]
            items['qty'] = stock_data[6]
            yield items

CodePudding user response:

  1. All data items columns are static except PClose, Diff

  2. If you make disabled JavaScript from the browser then you will notice that PClose, Diff columns go disappeared.

  3. To mimic those two columns you can use an automation tool something like selenium.

  4. I use selenium with pandas to pull the table data without any complexity

An example is given selenium with pandas:

    import time
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from webdriver_manager.chrome import ChromeDriverManager
    
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) 
    driver.get('https://merolagani.com/LatestMarket.aspx')
    time.sleep(2)
    d= pd.read_html(driver.page_source)[0]
    df=d.iloc[:,[0,1,2,3,4,5,6,7,8]]
    print(df)

Output:

       Symbol      LTP  % Change     Open     High      Low    Qty.   PClose  Diff.
0     ACLBSL    825.1     -1.66    822.3    826.0    822.3     292    839.0  -13.9
1       ADBL    331.0      1.88    325.0    333.0    323.0   23305    324.9    6.1
2    ADBLD83   1066.0      9.87    989.6   1066.0    989.6     170    970.2   95.8
3       AHPC    356.1     -2.38    370.0    370.0    351.5   54583    364.8   -8.7
4        AIL    448.0     -1.54    446.0    448.0    446.0     424    455.0   -7.0
..       ...      ...       ...      ...      ...      ...     ...      ...    ...
457      UNL  18360.0      2.00  18360.0  18360.0  18360.0      10  18000.0  360.0
458     UPCL    227.8     -0.96    231.5    232.0    227.2   31918    230.0   -2.2
459    UPPER    542.0      0.37    550.0    580.0    542.0  355232    540.0    2.0
460     USLB   1000.0      0.10   1000.0   1000.0   1000.0      30    999.0    1.0
461     VLBS    940.0     -0.32    943.0    954.9    925.1    1419    943.0   -3.0

[462 rows x 9 columns]

CodePudding user response:

Correctly import selenium & use chromedriver, even in a Jupyter notebook:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

While specific <td>s which receive live data are fiddly to locate, we can look at the page and find another element which appears when live data is received:

browser.get("https://merolagani.com/LatestMarket.aspx")
live_data = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="index-slider"]')))
dfs= pd.read_html(browser.page_source)
browser.quit() ## very important, otherwise you end up using all memory
print(dfs[0].iloc[:,:9])

This will return the table in question with live values as well, as a dataframe. Also, you only wait until live data is loaded in the page.

  • Related