I am trying to scrape the following website:
www.londonstockexchange.com/news-article/THRG/net-asset-value-s/15242427
Basically I just want to save the text, i.e. the following:
"The unaudited net asset values for BlackRock Throgmorton Trust PLC at close of business on 7 December 2021 were: 938.74p Capital only 947.82p Including current year income"
I've tried using the following code, however, I can't seem to parse the element. Any ideas to why?
url = "https://www.londonstockexchange.com/news-article/THRG/net-asset-value-s/15242427"
page = requests.get(url) # Requests website
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all('div', attrs={'class':'news-body-content'})
table
Tried various way, however, without luck. Hope someone can help.
CodePudding user response:
This is because that particular text is dynamically added to the page via JavaScript, whereas in your code you are simply making a request to the server and then getting the data returned (that is, the page source).
It is the client (i.e. your browser) that runs the JavaScript code, but BeautifulSoup on its own does not have that capability, so you will need to use a different library to render the JavaScript content first, and then BeautifulSoup to parse it.
CodePudding user response:
I agree with all the other posters here; in order to properly scrape this, you should use a different library that executes JavaScript.
However, if you are a glutton for pain and can only use BeautifulSoup, the information you are looking for is accessible.
There is a script tag at the bottom of the page and you can use soup.find(id='ng-lseg-state')
to access it. It will be a mess of a string but the information is in there:
\nNET ASSET VALUE\ n\ nBLACKROCK THROGMORTON TRUST PLC\ n5493003B7ETS1JEDPF59\ n\ nThe unaudited
net asset values
for BlackRock Throgmorton Trust PLC at close of \
nbusiness on 7 December 2021 were: \n\ n938 .74 p Capital only\ n947 .82 p Including current year income\ n
Super ugly, I know but it is there. Probably should just do what @hsac and use a different library.