I want to scrape historical data of financial reports. but I have no idea how to do it.
I tried googling historical values but I came out empty.
I want to extract from all the reports so for the sake of the example if someone can help me scrape from this address https://www.investing.com/economic-calendar/initial-jobless-claims-294 i would really appreciate it.
i will manually change the address to the diffrent reports i need. i just want to do an excel sheet that will follow on all of those reports.
Thank you!
CodePudding user response:
As a challenge, here is a solution to your conundrum. Bear in mind this will only work for the page you confirmed (jobless claims report), however it should give you a headstart. As you can see, that page has a 'Show more' button, under that table. If you open Dev tools - Network tab, you will see that every time you click that button, a post request is being made to an API endpoint, with a payload, and the response is a json object, with some html in it. You need to scrape that API endpoint. Here is a way to do it:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from datetime import datetime, timedelta
from tqdm import tqdm ## if using Jupyter notebook, import it as: from tqdm.notebook import tqdm
big_list = []
headers = {
'content-type': 'application/x-www-form-urlencoded',
'x-requested-with': 'XMLHttpRequest',
'accept': 'application/json, text/javascript, */*; q=0.01',
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
big_df = pd.DataFrame()
starting_date = '2022-09-29'
DD = timedelta(days=42)
def tqdm_generator():
while True:
yield
for _ in tqdm(tqdm_generator()):
payload = f'eventID=1&event_attr_ID=294&event_timestamp={starting_date}'
r = requests.post('https://www.investing.com/economic-calendar/more-history', data=payload, headers=headers)
if r.json()['hasMoreHistory'] == '1':
df = pd.read_html('<table>' r.json()['historyRows'] '</table>')[0]
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
starting_date = str(datetime.strptime(starting_date, '%Y-%m-%d') - DD).split()[0]
else:
break
big_df = big_df.drop_duplicates()
big_df.columns = ['Release Date', 'Time', 'Actual', 'Forecast','Previous', 'Revised']
print(big_df)
big_df.to_csv('jobless_claims_us.csv')
This will save that data in a csv file, and display the dataframe in terminal:
455/? [03:26<00:00, 2.18it/s]
Release Date Time Actual Forecast Previous Revised
0 Sep 22, 2022 08:30 213K 218K 208K NaN
1 Sep 15, 2022 08:30 213K 226K 218K NaN
2 Sep 08, 2022 08:30 222K 240K 228K NaN
3 Sep 01, 2022 08:30 232K 248K 237K NaN
4 Aug 25, 2022 08:30 243K 253K 245K NaN
... ... ... ... ... ... ...
2711 Jun 01, 1970 (May) 05:00 305.90K NaN 331.59K NaN
2717 May 01, 1970 (Apr) 05:00 331.59K NaN 268.68K NaN
2722 Apr 01, 1970 (Mar) 05:00 268.68K NaN 258.50K NaN
2723 Mar 01, 1970 (Feb) 05:00 258.50K NaN 249.32K NaN
2729 Feb 01, 1970 (Jan) 05:00 249.32K NaN 221.09K NaN
1216 rows × 6 columns
Docs for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
For pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
For TQDM: https://pypi.org/project/tqdm/