Here's link for scraping : https://stockanalysis.com/stocks/
I'm trying to get all the rows of the table (6000 rows), but I only get the first 500 results. I guess it has to do with the condition of how many rows to display.
I tried almost everything I can. I'm , ALSO, a beginner in web scraping.
My code :
# Importing libraries
import numpy as np # numerical computing library
import pandas as pd # panel data library
import requests # http requests library
from bs4 import BeautifulSoup
url = 'https://stockanalysis.com/stocks/'
headers = {'User-Agent': ' user agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36'}
r = requests.get(url, headers)
soup = BeautifulSoup(r.text, 'html')
league_table = soup.find('table', class_ = 'symbol-table index')
col_df = ['Symbol', 'Company_name', 'Industry', 'Market_Cap']
for team in league_table.find_all('tbody'):
# i = 1
rows = team.find_all('tr')
df = pd.DataFrame(np.zeros([len(rows), len(col_df)]))
df.columns = col_df
for i, row in enumerate(rows):
s_symbol = row.find_all('td')[0].text
s_company_name = row.find_all('td')[1].text
s_industry = row.find_all('td')[2].text
s_market_cap = row.find_all('td')[3].text
df.iloc[i] = [s_symbol, s_company_name, s_industry, s_market_cap]
len(df) # should > 6000
What should I do?
CodePudding user response:
Take a look down the bottom of the html and you will see this
<script id="__NEXT_DATA__" type="application/json">
Try using bs4 to find this tag and load the data from inside it, I think this is everything you need.
CodePudding user response:
As stated, it's in the <script>
tags. Pull it and read it in.
import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd
url = 'https://stockanalysis.com/stocks/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = str(soup.find('script', {'id':'__NEXT_DATA__'}))
jsonStr = re.search('({.*})', jsonStr).group(0)
jsonData = json.loads(jsonStr)
df = pd.DataFrame(jsonData['props']['pageProps']['stocks'])
Output:
print(df)
s ... i
0 A ... Life Sciences Tools & Services
1 AA ... Metals & Mining
2 AAC ... Blank Check / SPAC
3 AACG ... Diversified Consumer Services
4 AACI ... Blank Check / SPAC
... ... ...
6033 ZWS ... Utilities-Regulated Water
6034 ZY ... Chemicals
6035 ZYME ... Biotechnology
6036 ZYNE ... Pharmaceuticals
6037 ZYXI ... Health Care Equipment & Supplies
[6038 rows x 4 columns]