I'm trying to index date values from the date column in a panda dataframe.
Open High Low Close Volume Dividends Stock Splits
Date
2013-01-07 00:00:00-05:00 71.951356 72.440875 71.877924 72.285858 1859400 0.0 0.0
2013-01-08 00:00:00-05:00 72.090030 72.514278 71.151782 71.535240 2173700 0.0 0.0
2013-01-09 00:00:00-05:00 71.926850 72.848777 71.779993 72.661125 2569200 0.0 0.0
2013-01-10 00:00:00-05:00 72.971141 73.370914 72.448986 73.354599 2364800 0.0 0.0
2013-01-11 00:00:00-05:00 73.313856 73.950232 73.036458 73.118050 3284900 0.0 0.0
... ... ... ... ... ... ... ...
2022-12-30 00:00:00-05:00 426.880005 429.739990 424.820007 428.760010 759900 0.0 0.0
2023-01-03 00:00:00-05:00 428.000000 428.160004 420.200012 424.290009 1166300 0.0 0.0
2023-01-04 00:00:00-05:00 423.929993 426.359985 420.209991 423.480011 1050800 0.0 0.0
2023-01-05 00:00:00-05:00 422.570007 423.489990 412.609985 414.730011 1485000 0.0 0.0
2023-01-06 00:00:00-05:00 416.200012 428.540009 416.010010 426.470001 1042200 0.0 0.0
When I look up the column names, the first column 'Date' doesn't exist, so am unable to reference it when setting an index.
>>> for col in stock_df.columns:
... print(col)
...
Open
High
Low
Close
Volume
Dividends
Stock Splits
When I reference index[0] in my data frame, it shows the first column being 'Open'.
>>> stock_current_price = stock_df["Open"].loc[stock_df.index[0]]
>>> print(stock_current_price)
71.95135572116979
If I reference iloc[:, 0] to my dataframe (or any integer), I get the 'Open' column value with the date included. But I can't separate the date column to be used as an index.
>>> test = stock_df.iloc[:, 0]
>>> print(test)
Date
2013-01-07 00:00:00-05:00 71.951356
2013-01-08 00:00:00-05:00 72.090030
2013-01-09 00:00:00-05:00 71.926850
2013-01-10 00:00:00-05:00 72.971141
2013-01-11 00:00:00-05:00 73.313856
...
2022-12-30 00:00:00-05:00 426.880005
2023-01-03 00:00:00-05:00 428.000000
2023-01-04 00:00:00-05:00 423.929993
2023-01-05 00:00:00-05:00 422.570007
2023-01-06 00:00:00-05:00 416.200012
This includes any index integer referenced, so other column iloc's will also have the date shown.
>>> test = stock_df.iloc[:, 4]
>>> print(test)
Date
2013-01-07 00:00:00-05:00 1859400
2013-01-08 00:00:00-05:00 2173700
2013-01-09 00:00:00-05:00 2569200
2013-01-10 00:00:00-05:00 2364800
2013-01-11 00:00:00-05:00 3284900
...
2022-12-30 00:00:00-05:00 759900
2023-01-03 00:00:00-05:00 1166300
2023-01-04 00:00:00-05:00 1050800
2023-01-05 00:00:00-05:00 1485000
2023-01-06 00:00:00-05:00 1042200
I would like to take a large dataframe and specify a column value (Open) for example based on a specific "%Y-%m-%d" date. I'm worn out from coding tonight and hoping that a second set of eyes can take a look and help point me in the right direction.
CodePudding user response:
A DataFrame is surrounded by an index and columns; that's its primary difference from, say, a NumPy array. The "Date" values are in the index part, not in its own column, so, e.g., df["Date"]
will fail and df.iloc[:, N]
wouldn't find it for any N; because it's df.index
. If you must, you can move it to become a column with df.reset_index()
.
However! Index is a crucial part of a DataFrame (or a Series). Almost all operations care about it and you should too :) Arguably, the "Date" is better kept as an index in your case, because
specify a column value (Open) for example based on a specific "%Y-%m-%d" date
is conveniently doable as it stands with
df.loc["1997-04-27", "Open"]
as opposed to
df.loc[df.Date.eq("1997-04-27"), "Open"]
if it ("Date") was a column instead. And it turns out it only gets worse if you want to perform more datetime-specific operations, e.g., resampling, subsetting etc.
(Noting that in df.loc
, part before "," addresses index, and after "," addresses columns.)