I found this example in this post from Stackoverflow:
import pandas as pd
earliest = pd.Timestamp('2012-01-01 06:00:00')
latest = pd.Timestamp('2014-12-01 23:00:00')
dr = pd.date_range(start=earliest, end=latest,freq="30min")
df_freq = pd.DataFrame(index=dr, columns=['freq'])
df_freq = df_freq.fillna(0)
# use str datetime as key
df_freq['2012-03-04']
That post was published about five years ago, so the pandas API may have changed. Even when the code still works in 2022, I got this warning:
FutureWarning: Indexing a DataFrame with a datetimelike index using a single string
to slice the rows, like `frame[string]`, is deprecated and will be removed in a future
version. Use `frame.loc[string]` instead.
However, it seems that such indexing by datetime only works for specific frequencies.
If you change the frequency from "30min"
to "D"
(daily), the same snippet just does not work:
import pandas as pd
earliest = pd.Timestamp('2012-01-01')
latest = pd.Timestamp('2014-12-01')
dr = pd.date_range(start=earliest, end=latest, freq="D") # if freq is D, it does not work
df_freq = pd.DataFrame(index=dr, columns=['freq'])
df_freq = df_freq.fillna(0)
df_freq['2012-03-04']
In that case, I got a KeyError: '2012-03-04'
error message.
The workaround is easy, you must use df_freq.loc['2012-03-04']
for indexing.
However, I do not know why do I need to change the frequency for accessing a dataframe.
Is this behavior documented elsewhere?
CodePudding user response:
The current pandas docs
say:
Warning
Indexing DataFrame rows with a single string with getitem (e.g. frame[dtstring]) is deprecated starting with pandas 1.2.0 (given the ambiguity whether it is indexing the rows or selecting a column) and will be removed in a future version. The equivalent with .loc (e.g. frame.loc[dtstring]) is still supported.
When you look at the docs for version 1.1.5
(i.e. last version before 1.2.0
), you'll find that your issue is covered there already:
Warning
However, if the string is treated as an exact match, the selection in DataFrame’s [] will be column-wise and not row-wise, see Indexing Basics. For example dft_minute['2011-12-31 23:59'] will raise KeyError as '2012-12-31 23:59' has the same resolution as the index and there is no column with such name:
To always have unambiguous selection, whether the row is treated as a slice or a single selection, use .loc.
So, returning to your specific situation, in your first example with freq="30min"
, df_freq['2012-03-04']
is not an exact match. E.g.:
print(df_freq['2012-03-04'].head())
freq
2012-03-04 00:00:00 0
2012-03-04 00:30:00 0
2012-03-04 01:00:00 0
2012-03-04 01:30:00 0
2012-03-04 02:00:00 0
However, in your second example with freq="D"
, df_freq['2012-03-04']
does constitute an exact match. As a result, the program starts looking for a column
match, rather than an index
match. And since your df_freq
does not have any such column, you receive the error. Indeed, the last part of the traceback for df_freq['2012-03-04']
applied to your second example shows you this:
Traceback (most recent call last):
File "script.py", line 8, in <module>
df_freq['2012-03-04']
File "..\pandas\core\frame.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "..\pandas\core\indexes\base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: '2012-03-04'
Note indexer = self.columns.get_loc(key)
. I.e. the failure lies with df_freq.columns.get_loc('2012-03-04')
.
So, the problem doesn't have anything to do with the selected frequency, really. Something like df_freq['2012-03']
for your second example would still get you a slice as well (but with the same warning, of course).
The problem is also avoidable by using df_freq['2012-03-04':'2012-03-04']
. Applied to your second example:
print(df_freq['2012-03-04':'2012-03-04'])
freq
2012-03-04 0
However, note that this returns a df
-slice:
print(type(df_freq['2012-03-04':'2012-03-04']))
<class 'pandas.core.frame.DataFrame'>
Contrast .loc
:
print(type(df_freq.loc['2012-03-04']))
<class 'pandas.core.series.Series'>