I am beginning to learn webpage parsing using BeutifulSoup in Python. I am trying to get news items of a stock from www.tradingview.com. The webpage I am specifically trying is https://www.tradingview.com/symbols/NSE-TORNTPHARM/news/. I am using BeautifulSoup in Python. From the following webpage I am trying to get all the hrefs having a class : card-wSNJR2eq cardLink-wSNJR2eq. enter image description here
This returns none. I used the following code:
for a in html.find_all('a', class_="card-wSNJR2eq cardLink-wSNJR2eq"):
print ("Found the URL:", a['href'])
Even listing all the "a" in the web page doesn't show these particular hrefs which contain the news headlines. I used the following codes:
for a in html.find_all('a', href=True):
print ("Found the URL:", a['href'])
as well as
html = BeautifulSoup(response, "html.parser")
topa = html.find_all('a')
Both the above codes dont list the "a" which have hrefs that contain the headline. All the other 'a' are listed.
Please help to understand what I am missing.
CodePudding user response:
Those links are being loaded dynamically in page. You neeed to inspect the Network tab in Dev tools, and you will notice an XHR call is being made to an API, which returns a JSON response. The following code will get you the information loaded by javascript in page:
import requests
import pandas as pd
# headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
headers = {
'authority': 'news-headlines.tradingview.com',
'method': 'GET',
'path': '/headlines/?category=stock&lang=en&symbol=NSE:TORNTPHARM',
'scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'origin': 'https://www.tradingview.com',
'referer': 'https://www.tradingview.com/',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-site',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36'
}
url='https://news-headlines.tradingview.com/headlines/?category=stock&lang=en&symbol=NSE:TORNTPHARM'
r = requests.get(url, headers=headers)
## dataframe
# df = pd.DataFrame(r.json())
# df
print(r.json()[0])
This will print out:
{'id': 'urn:newsml:mtnewswires.com:20220526:G2129513:0', 'title': "Torrent Pharmaceuticals to Acquire Four Brands from Dr. Reddy's Laboratories", 'sourceLogoId': 'mtnewswires', 'published': 1653620928, 'source': 'MT Newswires', 'urgency': 2, 'permission': 'headline', 'relatedSymbols': [{'symbol': 'NSE:DRREDDY', 'logoid': 'dr-reddys'}, {'symbol': 'NSE:TORNTPHARM', 'logoid': 'torrent-pharmaceuticals'}], 'astDescription': {'type': 'root', 'children': [{'type': 'p', 'children': ['Torrent Pharmaceuticals (', {'type': 'symbol', 'params': {'symbol': 'NSE:TORNTPHARM', 'text': 'NSE:TORNTPHARM'}}, ', BOM:500420) has agreed to acquire four pharmaceutical brands from ']}, ' ', {'type': 'p', 'children': ["Dr. Reddy's Laboratories (", {'type': 'symbol', 'params': {'symbol': 'NSE:DRREDDY', 'text': 'NSE:DRREDDY'}}, ', BOM:500124) for an undisclosed amount. ']}, {'type': 'p', 'children': ['The four brands include gynecology product Styptovit-E, as well as benign prostatic hyperplasia treatments Finast, Finast-T, and Dynapress, according to a Thursday night filing. ']}, {'type': 'p', 'children': ['According to the terms of the deal, Torrent Pharma will take over the manufacturing, marketing, and distribution of the brands in India. ']}, {'type': 'p', 'children': ['The transaction and integration of brands are expected to be completed by June.']}]}, 'summary': None}
You can further inspect the json object, to get the data you need.