Using pd.Grouper with a datetime key in conjunction with another key creates a set of groups, but this does not seem to encompass all of the groups that need to be created, in my opinion.
>>> test = pd.DataFrame({"id":["a","b"]*3, "b":pd.date_range("2000-01-01","2000-01-03", freq="9H")})
>>> test
id b
0 a 2000-01-01 00:00:00
1 b 2000-01-01 09:00:00
2 a 2000-01-01 18:00:00
3 b 2000-01-02 03:00:00
4 a 2000-01-02 12:00:00
5 b 2000-01-02 21:00:00
When I tried to create groups based on the date and id
values:
>>> g = test.groupby([pd.Grouper(key='b', freq="D"), 'id'])
>>> g.groups
{(2000-01-01 00:00:00, 'a'): [0], (2000-01-02 00:00:00, 'b'): [1]}
g.groups
shows only 2 groups when I expected 4 groups: both "a" and "b" for each day.
However, when I created another column based off of "b":
>>> test['date'] = test.b.dt.date
>>> g = test.groupby(['date', 'id'])
>>> g.groups
{(2000-01-01, 'a'): [0, 2], (2000-01-01, 'b'): [1], (2000-01-02, 'a'): [4], (2000-01-02, 'b'): [3, 5]}
The outcome was exactly what I expected.
I don't know how to make sense of these different outcomes. Please enlighten me.
CodePudding user response:
You do have 4 groups with the Grouper
, the output of g.groups
is misleading (maybe worth reporting as a bug?):
g = test.groupby([pd.Grouper(key='b', freq="D"), 'id'])
g.ngroups
# 4
g.size()
# b id
# 2000-01-01 a 2
# b 1
# 2000-01-02 a 1
# b 2
# dtype: int64
CodePudding user response:
I believe it is because of the difference between 'pd.Grouper' and the 'dt.date' method in pandas. 'pd.Grouper' groups by a range of values (e.g., daily, hourly, etc.) while 'dt.date' returns just the date part of a datetime object, effectively creating a categorical variable.
When you use 'pd.Grouper' with a frequency of "D", it will group by full days, so each day is represented by only one group. But in your case, each id has multiple records for a given day. So, 'pd.Grouper' is not able to capture all of the groups that you expect.
On the other hand, when you use the 'dt.date' method to extract the date part of the datetime, it creates a categorical variable that represents each date independently.
so when you group by this new date column along with the id column, each group will correspond to a unique combination of date and id, giving you the expected outcome.
In summary, pd.Grouper is useful when you want to group by a range of values (e.g., daily, hourly), while using a separate column for the exact values (e.g., a column for dates only) is useful when you want to group by specific values.
Hope it helps.