Home > other >  Web scrape using Python - Execution takes too long
Web scrape using Python - Execution takes too long

Time:11-14

I am trying to webscrape the "Active Positions" table from the following website:

https://www.nasdaq.com/market-activity/stocks/aapl/institutional-holdings

My code is below:

from bs4 import BeautifulSoup
import requests

html_text = requests.get('https://www.nasdaq.com/market-activity/stocks/aapl/institutional-holdings')
soup = BeautifulSoup(html_text, 'lxml')
job1 = soup.find('div', classs_ = 'dialog-off-canvas-main-canvas')
job2 = job1.find('div', class_ = 'page with-primary-nav hide-more-videos')
job3 = job2.find('div', class_ = 'page__main')
job4 = job3.find('div', class_ = 'page__content')
job5 = job4.find('div', class_ = 'quote-subdetail__content quote-subdetail__content--new')
job6 = job5.findAll('div', class_ = 'layout layout--2-col-large')
job7 = job6.find('div', class_ = 'institutional-holdings institutional-holdings--paginated')
job8 = job7.find('div', class_ = 'institutional-holdings__section institutional-holdings__section--active-positions')
job9 = job8.find('div', class_ = 'institutional-holdings__table-container')
job10 = job9.find('table', class_ = 'institutional-holdings__table')
job11 = job10.find('tbody', class_ = 'institutional-holdings__body')
job12 = job11.findAll('tr', class_ = 'institutional-holdings__row').text

print(job12)

I have chosen to include nearly every class path to attempt to speed up the execution, as including only a couple took up to 10 minutes before i decided to interupt. However, i still get the same long execution with no output. Is there something wrong with my code? Or can I improve this by doing something I haven't thought of? Thanks.

CodePudding user response:

Data is being hydrated in page via Javascript XHR calls. Here is a way of getting ActivePositions by scraping the API endpoint directly:

import requests
import pandas as pd

url = 'https://api.nasdaq.com/api/company/AAPL/institutional-holdings?limit=15&type=TOTAL&sortColumn=marketValue&sortOrder=DESC'

headers = {
    'accept': 'application/json, text/plain, */*',
    'origin': 'https://www.nasdaq.com',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

r = requests.get(url, headers=headers)
df = pd.json_normalize(r.json()['data']['activePositions']['rows'])
print(df)

Result in terminal:

positions   holders shares
0   Increased Positions 1,780   239,170,203
1   Decreased Positions 2,339   209,017,331
2   Held Positions  283 8,965,339,255
3   Total Institutional Shares  4,402   9,413,526,789

In case you want to scrape the big 4,402 Institutional Holders table, there are ways for that too.

EDIT: Here is how you can save the data to a json file:

df.to_json('active_positions.json')

Although it might make more sense to save it as tabular data (csv):

df.to_csv('active_positions.csv')

Pandas docs: https://pandas.pydata.org/docs/

  • Related