Unable to scrape data from Website with Python-CodePudding

I want to extract tables from "Bonds Traded on the Exchange" and "OTC trades"and save it to excel sheet. I am trying to scrape data with python ( BS & requests ) but I am unable to scrape data ( I dont wanna use selenium). Can any1 guide me ? I am not getting any error , it doesn't get prpcessed in python terminal I think terminal gets hanged , as I don't even get any error message .


import requests
import pandas as pd
import os
from bs4 import BeautifulSoup as bs



url = "https://www1.nseindia.com/products/content/debt/corp_bonds/cbm_reporting_homepage.htm"

#condition  True
#while condition:

html = requests.get(url).content
page= requests.get(url)
soup= bs(page.text, 'lxml')
df_list = pd.read_html(html)
df = df_list[0]     # can change 0 to other number 
print(df)

CodePudding user response：

If you look at Network tab, you will see cbm_reporting_cbricsL.htm which is what you need to scrape. By the way, you should also add headers for requests to work properly. See detailed explanation in this thread:

import requests
import pandas as pd
from bs4 import BeautifulSoup

res = requests.get(
    'https://www1.nseindia.com/products/dynaContent/debt/corp_bonds/htms/cbm_reporting_cbricsL.htm',
    headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
    )

soup = BeautifulSoup(res.text, 'lxml')

raw_columns = [row.find_all('td') for row in soup.find_all('tr')]

# first 3 items were dummy
df = pd.DataFrame.from_records(raw_columns[3:])

The result would look like:

0   [INE001A07TA7]  [HOUSING DEVELOPMENT FINANCE CORPORATION LTD S...   [ 100.0030] [ 4.7082]   [ 16]   [[ 168000.00]]  [ 100.0000] [ 4.7091]
1   [INE134E07AP6]  [POWER FINANCE CORPORATION LTD. TRI SRV CATIII...   [ 100.8500] [ 6.6934]   [ 1]    [ 1000.00 ] [ 100.8500] [ 6.6934]
2   [INE020B08963]  [RURAL ELECTRIFICATION CORPORATION LIMITED SR-...   [ 107.6835] [ 5.9200]   [ 1]    [ 1500.00 ] [ 107.6835] [ 5.9200]
3   [INE163N08131]  [-] [ 104.2195] [ 6.6200]   [ 1]    [ 780.00 ]  [ 104.2195] [ 6.6200]
4   [INE540P07343]  [-] [ 104.3408] [ 9.3603]   [ 6]    [[ 1110.00]]    [ 104.2640] [ 9.3800]
... ... ... ... ... ... ... ... ...
93  [INE377Y07250]  [BAJAJ HOUSING FINANCE LIMITED SR 27 5.69 NCD ...   [ 100.0300] [ 5.6845]   [ 1]    [ 9000.00 ] [ 100.0300] [ 5.6845]
94  [INE115A07ML7]  [LIC HOUSING FINANCE LIMITED SRTR349OP-1 7.4NC...   [ 105.0991] [ 5.5000]   [ 1]    [ 1000.00 ] [ 105.0991] [ 5.5000]
95  [INE020B07HN3]  [RURAL ELECTRIFICATION CORPORATION LIMITED SR-...   [ 123.6000] [ 4.4400]   [ 1]    [ 10.00 ]   [ 123.6000] [ 4.4400]
96  [INE101A08070]  [MAHINDRA AND MAHINDRA LIMITED 9.55 NCD 04JL63...   [ 125.5000] [ 7.5218]   [ 1]    [ 820.00 ]  [ 125.5000] [ 7.5218]
97  [INE062A08215]  [STATE BANK OF INDIA SERIES I 8.75 BD PERPETUA...   [ 104.5304] [ 7.0000]   [ 1]    [ 10.00 ]   [ 104.5304] [ 7.0000]

CodePudding user response：

THIS IS MY FINAL ANSWER


import requests
import pandas as pd

headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}

html = requests.get(
    'https://www1.nseindia.com/products/dynaContent/debt/corp_bonds/htms/cbm_reporting_cbricsL.htm',
    headers=headers).content


df_list = pd.read_html(html)
df = df_list[0]
print (df)