I want to scrape data from https://www.nasdaqtrader.com/trader.aspx?id=TradeHalts. I tried different approaches, like this, this, and this.
I could scrap static pages, but still don't understand the aspx format very well. I am copying here what I took from the first reference link:
import urllib
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://www.indiapost.gov.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://www.nitt.edu/prm/nitreg/ShowRes.aspx',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.request.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'https://www.nasdaqtrader.com/Trader.aspx?id=TradeHalts'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f)
# parse and retrieve two vital form values
viewstate = soup.findAll("input", {"type": "hidden", "name": "__VIEWSTATE"})
eventvalidation = soup.findAll("input", {"type": "hidden", "name": "__EVENTVALIDATION"})
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('__VIEWSTATEENCRYPTED', ''),
)
encodedFields = urllib.parse.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
# We use BeautifulSoup
soup = BeautifulSoup(f)
print(soup.content)
I cannot find the table information in the content. What am I missing?
Thanks for your help
CodePudding user response:
To get the data as pandas DataFrame you can use next example:
import requests
import pandas as pd
from io import StringIO
url = "https://www.nasdaqtrader.com/RPCHandler.axd"
headers = {
"Referer": "https://www.nasdaqtrader.com/trader.aspx?id=TradeHalts",
}
payload = {
"id": 2,
"method": "BL_TradeHalt.GetTradeHalts",
"params": "[]",
"version": "1.1",
}
data = requests.post(url, json=payload, headers=headers).json()
data = StringIO(data["result"])
df = pd.read_html(data)[0]
print(df.head(10).to_markdown(index=False))
Prints:
Halt Date | Halt Time | Issue Symbol | Issue Name | Market | Reason Codes | Pause Threshold Price | Resumption Date | Resumption Quote Time | Resumption Trade Time |
---|---|---|---|---|---|---|---|---|---|
07/06/2022 | 15:57:38 | COMSP | 9.25% Srs A Cmltv Redm Prf Stk | NASDAQ | LUDP | nan | 07/06/2022 | 15:57:38 | nan |
07/06/2022 | 12:51:35 | BRPMU | B. Riley Principal 150 Merg Ut | NASDAQ | LUDP | nan | 07/06/2022 | 12:51:35 | 12:56:35 |
07/06/2022 | 12:06:06 | VACC | Vaccitech plc ADS | NASDAQ | LUDP | nan | 07/06/2022 | 12:06:06 | 12:16:06 |
07/06/2022 | 11:15:10 | USEA | United Maritime Corp Cm St | NASDAQ | LUDP | nan | 07/06/2022 | 11:15:10 | 11:29:25 |
07/06/2022 | 10:28:53 | USEA | United Maritime Corp Cm St | NASDAQ | LUDP | nan | 07/06/2022 | 10:28:53 | 10:43:30 |
07/06/2022 | 10:18:19 | USEA | United Maritime Corp Cm St | NASDAQ | LUDP | nan | 07/06/2022 | 10:18:19 | 10:28:19 |
07/06/2022 | 09:41:43 | GAMB | Gambling.com Group Os | NASDAQ | LUDP | nan | 07/06/2022 | 09:41:43 | 09:46:43 |
07/06/2022 | 09:37:16 | USEA | United Maritime Corp Cm St | NASDAQ | LUDP | nan | 07/06/2022 | 09:37:16 | 10:17:41 |
07/06/2022 | 09:31:15 | JJN | iPathA Series B Bloomberg Nickel Subindex Total Return ETN | NYSE Arca | M | nan | 07/06/2022 | 09:36:15 | 09:36:15 |
07/06/2022 | 09:31:17 | AMTI | Applied Molecular Transport Cm | NASDAQ | LUDP | nan | 07/06/2022 | 09:31:17 | 09:36:17 |