How to scrape all rows from a dynamic table in html with multiple displays using python-CodePudding

Here's link for scraping : https://stockanalysis.com/stocks/

I'm trying to get all the rows of the table (6000 rows), but I only get the first 500 results. I guess it has to do with the condition of how many rows to display.

I tried almost everything I can. I'm , ALSO, a beginner in web scraping.

My code :

# Importing libraries
import numpy as np  # numerical computing library
import pandas as pd # panel data library
import requests     # http requests library
from bs4 import BeautifulSoup


url = 'https://stockanalysis.com/stocks/'
headers = {'User-Agent': ' user agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36'}
r = requests.get(url, headers)
soup = BeautifulSoup(r.text, 'html')
league_table = soup.find('table', class_ = 'symbol-table index')
col_df = ['Symbol', 'Company_name', 'Industry', 'Market_Cap']

for team in league_table.find_all('tbody'):
    # i = 1
    rows = team.find_all('tr')
    df = pd.DataFrame(np.zeros([len(rows), len(col_df)]))
    df.columns = col_df
    for i, row in enumerate(rows):
        s_symbol = row.find_all('td')[0].text
        s_company_name = row.find_all('td')[1].text
        s_industry = row.find_all('td')[2].text
        s_market_cap = row.find_all('td')[3].text
        df.iloc[i] = [s_symbol, s_company_name, s_industry, s_market_cap]

len(df) # should > 6000

What should I do?

CodePudding user response：

Take a look down the bottom of the html and you will see this

<script id="__NEXT_DATA__" type="application/json">

Try using bs4 to find this tag and load the data from inside it, I think this is everything you need.

CodePudding user response：

As stated, it's in the <script> tags. Pull it and read it in.

import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd

url = 'https://stockanalysis.com/stocks/'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = str(soup.find('script', {'id':'__NEXT_DATA__'}))

jsonStr = re.search('({.*})', jsonStr).group(0)
jsonData = json.loads(jsonStr)

df = pd.DataFrame(jsonData['props']['pageProps']['stocks'])

Output:

print(df)
         s  ...                                 i
0        A  ...    Life Sciences Tools & Services
1       AA  ...                   Metals & Mining
2      AAC  ...                Blank Check / SPAC
3     AACG  ...     Diversified Consumer Services
4     AACI  ...                Blank Check / SPAC
   ...  ...                               ...
6033   ZWS  ...         Utilities-Regulated Water
6034    ZY  ...                         Chemicals
6035  ZYME  ...                     Biotechnology
6036  ZYNE  ...                   Pharmaceuticals
6037  ZYXI  ...  Health Care Equipment & Supplies

[6038 rows x 4 columns]