I've been trying to build a 3D pandas dataframe of the following format:
Each 'panel' is for a specific stock and includes the open, close, low and high for that stock for many points in time Multiple panel are placed together to form the 3D dataframe (previously I think I would have called these panels, but it looks like panels have been depreciated?).
Here's what I'm currently trying:
for x in files:
dataloc = datalocation x
rawinputData.append(prepData(dataloc))
inputdata = pd.concat(rawinputData,keys=stockTicker)
Where preData looks like this:
def prepData(dataloc):
inputData = pd.read_csv(dataloc)
inputData.columns = ["OpenTime","Open","High","Low","Close"]
return inputData
I expected this this give me the type of dataframe where I can pull a slice through any given dimension (eg, all open prices for all stocks at all times; the open price for all stocks at time x; all prices for all times for stock x;) but when I run this type of operation I hit an error, I've only got 2 dimentions:
print(inputdata.shape)
returns
(314631, 4)
What am I doing wrong here?
CodePudding user response:
A pandas DataFrame is always two-dimensional in the sense that it has a row index and a column index, which span a table of values. However, you can represent higher-dimensional data as a DataFrame by making one (or both) of the indices a MultiIndex.
Here is an example with two "panels":
import pandas as pd
colnames = ['open', 'high', 'low', 'close']
dates = pd.Index(['2019-01-01', '2020-01-01', '2021-01-01'])
df1 = pd.DataFrame([[15, 20, 10, 11],
[21, 22, 20, 21],
[38, 40, 32, 37]],
index=dates, columns=colnames)
df2 = pd.DataFrame([[35, 40, 30, 35],
[45, 48, 43, 46],
[67, 70, 65, 66]],
index=dates, columns=colnames)
df = pd.concat([df1, df2], axis=1)
df.columns = pd.MultiIndex.from_product([['Daimler', 'Tesla'], colnames],
names=['stock', 'price'])
df
stock Daimler Tesla
price open high low close open high low close
2019-01-01 15 20 10 11 35 40 30 35
2020-01-01 21 22 20 21 45 48 43 46
2021-01-01 38 40 32 37 67 70 65 66
This does enable the slicing capabilities you want, for example, to get all the opening prices, you can use the cross-section method:
df.xs('open', level='price', axis=1)
stock Daimler Tesla
2019-01-01 15 35
2020-01-01 21 45
2021-01-01 38 67
To create a new column for each stock, you could loop over the stocks:
for stock in df.columns.levels[0]:
df[stock, 'range'] = df[stock, 'high'] - df[stock, 'low']
df = df.sort_index(axis=1)
df
stock Daimler Tesla
price close high low open range close high low open range
2019-01-01 11 20 10 15 10 35 40 30 35 10
2020-01-01 21 22 20 21 2 46 48 43 45 5
2021-01-01 37 40 32 38 8 66 70 65 67 5
An alternative that is perhaps more idiomatic is to stack the data according to stock, i.e. move the MultiIndex from the columns to the rows. Then you only have to assign one new column. For example, to produce the same result as above:
df = df.stack('stock')
df['range'] = df.high - df.low
df = df.unstack().swaplevel(axis=1).sort_index(axis=1)