Im trying to make a web-scraping script. I'm bumping into an error and cant seem to figure out why. I'm using spyder IDE, so all the variables are shown in variable explorer. My code is as follows
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
root = "https://finviz.com/quote.ashx?t="
tickers = ['AMZN', 'GS']
news_tables = {}
for ticker in tickers:
url = root ticker
req = Request(url=url, headers={'user-agent': 'dirty-30'})
response = urlopen(req)
#print(response)
html = BeautifulSoup(response, 'html')
# print(html)
news_table = html.find(id='news-table')
news_tables[ticker] = news_table
amzn_data = news_tables['AMZN']
amzn_rows = amzn_data.findALL('tr')
print(news_tables)
I get back error
TypeError: 'NoneType' object is not callable
Exception in comms call get_value:
File "C:\Users\austi\Anaconda3\lib\site-packages\spyder_kernels\comms\commbase.py", line 347, in _handle_remote_call
self._set_call_return_value(msg_dict, return_value)
File "C:\Users\austi\Anaconda3\lib\site-packages\spyder_kernels\comms\commbase.py", line 384, in _set_call_return_value
self._send_message('remote_call_reply', content=content, data=data,
File "C:\Users\austi\Anaconda3\lib\site-packages\spyder_kernels\comms\frontendcomm.py", line 109, in _send_message
return super(FrontendComm, self)._send_message(*args, **kwargs)
File "C:\Users\austi\Anaconda3\lib\site-packages\spyder_kernels\comms\commbase.py", line 247, in _send_message
buffers = [cloudpickle.dumps(
File "C:\Users\austi\Anaconda3\lib\site-packages\cloudpickle\cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "C:\Users\austi\Anaconda3\lib\site-packages\cloudpickle\cloudpickle_fast.py", line 609, in dump
raise pickle.PicklingError(msg) from e
_pickle.PicklingError: Could not pickle object as excessively deep recursion required.
I tried adding
sys.setrecursionlimit(30000000)
When I attempt to open the news_tables which is type=dict, I get a stack overflow message and the kernel restarts. What am I missing here? I think the nontype error stems from the stack overflow breaking the variable so it just deletes the dict causing an empty dict aka nonetype...
Why am I getting a stack overflow? This shouldn't be that much data? One scrape of one page? If my understanding of stack overflow is correct then all i can think of is there is somehow an infinite loop gathering the same data until i hit the pickling error ? I have several TB of mem on my system, have tons of RAM.
I'm perplexed any insights? I've restarted anaconda as a whole, my spyder is up to date.
Thanks.
CodePudding user response:
import requests
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0'
}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
allin = []
for t in ['AMZN', 'GS']:
params = {
't': t
}
r = req.get(url, params=params)
df = pd.read_html(r.content, attrs={'id': 'news-table'})[0]
allin.append(df)
df = pd.concat(allin, ignore_index=True)
print(df)
# df.to_csv('data.csv', index=False)
main('https://finviz.com/quote.ashx')
Output:
0 1
0 Dec-10-22 01:30PM Selling Your Home During the Holidays? 4 Moves...
1 12:21PM 15 Most Trusted Companies in the World Insider...
2 10:30AM Target, Amazon and 4 More Retailers That Will ...
3 10:30AM Opinion: These Will Be the 2 Largest Stocks by...
4 08:37AM Better Buy: Microsoft vs. Amazon Motley Fool
.. ... ...
195 11:49AM Goldman Sachs, Eager to Grow Cards Business, C...
196 10:53AM Oil Slips as Swelling China Covid Cases Outwei...
197 08:42AM Oil Dips Near $98 as Swelling China Covid Case...
198 08:41AM Morgan Stanley funds have billions riding on a...
199 06:30AM 3 Goldman Sachs Mutual Funds Worth Betting On ...
[200 rows x 2 columns]
Or
import requests
from bs4 import BeautifulSoup, SoupStrainer
from pprint import pp
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0'
}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
allin = {}
for t in ['AMZN', 'GS']:
params = {
't': t
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.content, 'lxml', parse_only=SoupStrainer(
'table', attrs={'id': 'news-table'}))
allin[t] = soup
pp(allin)
main('https://finviz.com/quote.ashx')
CodePudding user response:
Just replace
[...]
html = BeautifulSoup(response, 'html')
[...]
amzn_rows = amzn_data.findALL('tr')
with
[...]
html = BeautifulSoup(response, 'html.parser')
[...]
amzn_rows = amzn_data.find_all('tr')