I have been working on this website
Everything is fine but I am not able to get 2 more data items they are pClose and Diff in the table.
Any Reason for Not being printed When I try to print item at 7 index
i.e stock_data[7]
I get list index error any reason behind this?
Here is my code below
class FloorSheetSpider(scrapy.Spider):
name = "nepse"
# allowed_domains = ['nl.indeed.com']
start_urls = ['https://merolagani.com/LatestMarket.aspx']
items = []
def parse(self, response):
items = NepalLiveShareItem()
for tr in response.xpath("//table[@class='table table-hover live-trading sortable']//tbody//tr"):
stock_data = tr.css('td ::text').extract()
items['symbol'] = stock_data[0]
items['ltp'] = stock_data[1]
items['percent_change'] = stock_data[2]
items['open'] = stock_data[3]
items['high'] = stock_data[4]
items['low'] = stock_data[5]
items['qty'] = stock_data[6]
yield items
CodePudding user response:
All data items columns are static except
PClose, Diff
If you make disabled JavaScript from the browser then you will notice that
PClose, Diff
columns go disappeared.To mimic those two columns you can use an automation tool something like selenium.
I use selenium with pandas to pull the table data without any complexity
An example is given selenium with pandas:
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://merolagani.com/LatestMarket.aspx')
time.sleep(2)
d= pd.read_html(driver.page_source)[0]
df=d.iloc[:,[0,1,2,3,4,5,6,7,8]]
print(df)
Output:
Symbol LTP % Change Open High Low Qty. PClose Diff.
0 ACLBSL 825.1 -1.66 822.3 826.0 822.3 292 839.0 -13.9
1 ADBL 331.0 1.88 325.0 333.0 323.0 23305 324.9 6.1
2 ADBLD83 1066.0 9.87 989.6 1066.0 989.6 170 970.2 95.8
3 AHPC 356.1 -2.38 370.0 370.0 351.5 54583 364.8 -8.7
4 AIL 448.0 -1.54 446.0 448.0 446.0 424 455.0 -7.0
.. ... ... ... ... ... ... ... ... ...
457 UNL 18360.0 2.00 18360.0 18360.0 18360.0 10 18000.0 360.0
458 UPCL 227.8 -0.96 231.5 232.0 227.2 31918 230.0 -2.2
459 UPPER 542.0 0.37 550.0 580.0 542.0 355232 540.0 2.0
460 USLB 1000.0 0.10 1000.0 1000.0 1000.0 30 999.0 1.0
461 VLBS 940.0 -0.32 943.0 954.9 925.1 1419 943.0 -3.0
[462 rows x 9 columns]
CodePudding user response:
Correctly import selenium & use chromedriver, even in a Jupyter notebook:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
While specific <td>
s which receive live data are fiddly to locate, we can look at the page and find another element which appears when live data is received:
browser.get("https://merolagani.com/LatestMarket.aspx")
live_data = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="index-slider"]')))
dfs= pd.read_html(browser.page_source)
browser.quit() ## very important, otherwise you end up using all memory
print(dfs[0].iloc[:,:9])
This will return the table in question with live values as well, as a dataframe. Also, you only wait until live data is loaded in the page.