Why do I need to change the frequency for accessing a dataframe indexed with DatetimeIndex in Pandas-CodePudding

I found this example in this post from Stackoverflow:

import pandas as pd
earliest = pd.Timestamp('2012-01-01 06:00:00')
latest = pd.Timestamp('2014-12-01 23:00:00')
dr = pd.date_range(start=earliest, end=latest,freq="30min")
df_freq = pd.DataFrame(index=dr, columns=['freq'])
df_freq = df_freq.fillna(0)

# use str datetime as key 
df_freq['2012-03-04']

That post was published about five years ago, so the pandas API may have changed. Even when the code still works in 2022, I got this warning:

FutureWarning: Indexing a DataFrame with a datetimelike index using a single string
to slice the rows, like `frame[string]`, is deprecated and will be removed in a future 
version. Use `frame.loc[string]` instead.

However, it seems that such indexing by datetime only works for specific frequencies. If you change the frequency from "30min" to "D" (daily), the same snippet just does not work:

import pandas as pd
earliest = pd.Timestamp('2012-01-01')
latest = pd.Timestamp('2014-12-01')
dr = pd.date_range(start=earliest, end=latest, freq="D")     # if freq is D, it does not work
df_freq = pd.DataFrame(index=dr, columns=['freq'])
df_freq = df_freq.fillna(0)

df_freq['2012-03-04']

In that case, I got a KeyError: '2012-03-04' error message.

The workaround is easy, you must use df_freq.loc['2012-03-04'] for indexing. However, I do not know why do I need to change the frequency for accessing a dataframe. Is this behavior documented elsewhere?

CodePudding user response：

The current pandas docs say:

Warning

Indexing DataFrame rows with a single string with getitem (e.g. frame[dtstring]) is deprecated starting with pandas 1.2.0 (given the ambiguity whether it is indexing the rows or selecting a column) and will be removed in a future version. The equivalent with .loc (e.g. frame.loc[dtstring]) is still supported.

When you look at the docs for version 1.1.5 (i.e. last version before 1.2.0), you'll find that your issue is covered there already:

Warning

However, if the string is treated as an exact match, the selection in DataFrame’s [] will be column-wise and not row-wise, see Indexing Basics. For example dft_minute['2011-12-31 23:59'] will raise KeyError as '2012-12-31 23:59' has the same resolution as the index and there is no column with such name:

To always have unambiguous selection, whether the row is treated as a slice or a single selection, use .loc.

So, returning to your specific situation, in your first example with freq="30min", df_freq['2012-03-04'] is not an exact match. E.g.:

print(df_freq['2012-03-04'].head())

                     freq
2012-03-04 00:00:00     0
2012-03-04 00:30:00     0
2012-03-04 01:00:00     0
2012-03-04 01:30:00     0
2012-03-04 02:00:00     0

However, in your second example with freq="D", df_freq['2012-03-04'] does constitute an exact match. As a result, the program starts looking for a column match, rather than an index match. And since your df_freq does not have any such column, you receive the error. Indeed, the last part of the traceback for df_freq['2012-03-04'] applied to your second example shows you this:

Traceback (most recent call last):

  File "script.py", line 8, in <module>
    df_freq['2012-03-04']

  File "..\pandas\core\frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)

  File "..\pandas\core\indexes\base.py", line 3623, in get_loc
    raise KeyError(key) from err

KeyError: '2012-03-04'

Note indexer = self.columns.get_loc(key). I.e. the failure lies with df_freq.columns.get_loc('2012-03-04').

So, the problem doesn't have anything to do with the selected frequency, really. Something like df_freq['2012-03'] for your second example would still get you a slice as well (but with the same warning, of course).

The problem is also avoidable by using df_freq['2012-03-04':'2012-03-04']. Applied to your second example:

print(df_freq['2012-03-04':'2012-03-04'])

            freq
2012-03-04     0

However, note that this returns a df-slice:

print(type(df_freq['2012-03-04':'2012-03-04']))
<class 'pandas.core.frame.DataFrame'>

Contrast .loc:

print(type(df_freq.loc['2012-03-04']))
<class 'pandas.core.series.Series'>