I am trying to create a dataframe from a dict with different key names.
Here is a MWE:
# loads price data
from yahoofinancials import YahooFinancials
yahoo_tickers = ['SMT.L', 'MSFT', 'NIO']
yahoo_financials = YahooFinancials(yahoo_tickers)
data = yahoo_financials.get_historical_price_data(start_date='1970-01-30',
end_date='2022-12-30',
time_interval='daily')
This prints a dictionary with the three keys (here the name of the yahoo tickers) with information about each stock. This is stored as another dictionary within a dictionary.
I tried to clean up this dict
via loops, but end up being quite a slow process, especially with many more stocks.
Can someone suggest something that I can quickly convert the data
dict
in a dataframe
that looks like this:
My expected result:
Out[11]:
high low open ... adjclose yahoo_ticker instrumentType
Date ...
1968-12-31 4.852 4.852 4.852 ... 1.762543 SMT.L EQUITY
1969-01-01 4.852 4.852 4.852 ... 1.762543 SMT.L EQUITY
1969-01-02 4.852 4.852 4.852 ... 1.762543 SMT.L EQUITY
1969-01-03 4.852 4.852 4.852 ... 1.762543 SMT.L EQUITY
1969-01-06 4.852 4.852 4.852 ... 1.762543 SMT.L EQUITY
... ... ... ... ... ... ...
2022-12-23 11.220 10.690 11.220 ... 10.970000 NIO EQUITY
2022-12-27 10.610 9.970 10.530 ... 10.060000 NIO EQUITY
2022-12-28 10.250 9.610 10.010 ... 9.800000 NIO EQUITY
2022-12-29 10.270 9.770 9.920 ... 9.990000 NIO EQUITY
2022-12-30 9.980 9.520 9.830 ... 9.750000 NIO EQUITY
CodePudding user response:
Using json_normalize(). Since their isn't just 1 main key you need to normalize with a list comp and concat the results. The nested column "prices" will be used in the record_path to flatten the data. Using method chaining to clean up and format df.
import pandas as pd
from yahoofinancials import YahooFinancials
yahoo_tickers = ["SMT.L", "MSFT", "NIO"]
yahoo_financials = YahooFinancials(yahoo_tickers)
data = yahoo_financials.get_historical_price_data(
start_date="1970-01-30",
end_date="2022-12-30",
time_interval="daily"
)
df = (
pd
.concat([pd.json_normalize(data=data.get(x), record_path=["prices"], meta=["instrumentType"]) for x in data], keys=data.keys())
.droplevel(level=1)
.reset_index(names="yahoo_ticker")
.drop(columns="date")
.set_index("formatted_date")
.rename(columns={"formatted_date": "date"})
)
print(df)
Output:
yahoo_ticker high low ... volume adjclose instrumentType
formatted_date ...
1970-01-30 SMT.L 3.852 3.852 ... 0 1.399282 EQUITY
1970-02-02 SMT.L 3.852 3.852 ... 0 1.399282 EQUITY
1970-02-03 SMT.L 3.852 3.852 ... 0 1.399282 EQUITY
1970-02-04 SMT.L 3.852 3.852 ... 0 1.399282 EQUITY
1970-02-05 SMT.L 3.852 3.852 ... 0 1.399282 EQUITY
... ... ... ... ... ... ... ...
2022-12-22 NIO 11.580 10.760 ... 32468900 11.290000 EQUITY
2022-12-23 NIO 11.220 10.690 ... 33610000 10.970000 EQUITY
2022-12-27 NIO 10.610 9.970 ... 54165700 10.060000 EQUITY
2022-12-28 NIO 10.250 9.610 ... 42225300 9.800000 EQUITY
2022-12-29 NIO 10.270 9.770 ... 49380200 9.990000 EQUITY
CodePudding user response:
I would suggest using list compression to flatten the nested dictionary also it will be much more efficient than using pandas normalization methods.
pd.DataFrame([
{
**p,
'yahooTicker': k,
'instrumentType': v['instrumentType']
} for k, v in data.items() for p in v['prices']
]).drop(columns=['date']).set_index('formatted_date')
Result
high low open close volume adjclose yahooTicker instrumentType
formatted_date
1970-01-30 3.852 3.852 3.852 3.852 0 1.399282 SMT.L EQUITY
1970-02-02 3.852 3.852 3.852 3.852 0 1.399282 SMT.L EQUITY
1970-02-03 3.852 3.852 3.852 3.852 0 1.399282 SMT.L EQUITY
1970-02-04 3.852 3.852 3.852 3.852 0 1.399282 SMT.L EQUITY
1970-02-05 3.852 3.852 3.852 3.852 0 1.399282 SMT.L EQUITY
...