Home > Software design >  scrapin historical data from investing.com
scrapin historical data from investing.com

Time:09-28

I want to scrape historical data of financial reports. but I have no idea how to do it.

I tried googling historical values but I came out empty.

I want to extract from all the reports so for the sake of the example if someone can help me scrape from this address https://www.investing.com/economic-calendar/initial-jobless-claims-294 i would really appreciate it.

i will manually change the address to the diffrent reports i need. i just want to do an excel sheet that will follow on all of those reports.

Thank you!

CodePudding user response:

As a challenge, here is a solution to your conundrum. Bear in mind this will only work for the page you confirmed (jobless claims report), however it should give you a headstart. As you can see, that page has a 'Show more' button, under that table. If you open Dev tools - Network tab, you will see that every time you click that button, a post request is being made to an API endpoint, with a payload, and the response is a json object, with some html in it. You need to scrape that API endpoint. Here is a way to do it:

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from datetime import datetime, timedelta
from tqdm import tqdm ## if using Jupyter notebook, import it as: from tqdm.notebook import tqdm

big_list = []
headers = {
    'content-type': 'application/x-www-form-urlencoded',
    'x-requested-with': 'XMLHttpRequest',
    'accept': 'application/json, text/javascript, */*; q=0.01',
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
    }

big_df = pd.DataFrame()
starting_date = '2022-09-29'
DD = timedelta(days=42)

def tqdm_generator():
  while True:
    yield

for _ in tqdm(tqdm_generator()):
    payload = f'eventID=1&event_attr_ID=294&event_timestamp={starting_date}'

    r = requests.post('https://www.investing.com/economic-calendar/more-history', data=payload, headers=headers)
    if r.json()['hasMoreHistory'] == '1':
        df = pd.read_html('<table>'   r.json()['historyRows']   '</table>')[0]
        big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
        starting_date = str(datetime.strptime(starting_date, '%Y-%m-%d') - DD).split()[0]
    else:
        break
big_df = big_df.drop_duplicates()
big_df.columns = ['Release Date', 'Time', 'Actual', 'Forecast','Previous', 'Revised']
print(big_df)
big_df.to_csv('jobless_claims_us.csv')

This will save that data in a csv file, and display the dataframe in terminal:

455/? [03:26<00:00, 2.18it/s]
Release Date    Time    Actual  Forecast    Previous    Revised
0   Sep 22, 2022    08:30   213K    218K    208K    NaN
1   Sep 15, 2022    08:30   213K    226K    218K    NaN
2   Sep 08, 2022    08:30   222K    240K    228K    NaN
3   Sep 01, 2022    08:30   232K    248K    237K    NaN
4   Aug 25, 2022    08:30   243K    253K    245K    NaN
... ... ... ... ... ... ...
2711    Jun 01, 1970 (May)  05:00   305.90K NaN 331.59K NaN
2717    May 01, 1970 (Apr)  05:00   331.59K NaN 268.68K NaN
2722    Apr 01, 1970 (Mar)  05:00   268.68K NaN 258.50K NaN
2723    Mar 01, 1970 (Feb)  05:00   258.50K NaN 249.32K NaN
2729    Feb 01, 1970 (Jan)  05:00   249.32K NaN 221.09K NaN
1216 rows × 6 columns

Docs for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

For pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

For TQDM: https://pypi.org/project/tqdm/

  • Related