How to reindex a datetime-based multiindex in pandas-CodePudding

I have a dataframe that counts the number of times an event has occured per user per day. Users may have 0 events per day and (since the table is an aggregate from a raw event log) rows with 0 events are missing from the dataframe. I would like to add these missing rows and group the data by week so that each user has one entry per week (including 0 if applicable).

Here is an example of my input:

import numpy as np
import pandas as pd

np.random.seed(42)

df = pd.DataFrame({
    "person_id": np.arange(3).repeat(5),
    "date": pd.date_range("2022-01-01", "2022-01-15", freq="d"),
    "event_count": np.random.randint(1, 7, 15),
})

# end of each week
# Note: week 2022-01-23 is not in df, but should be part of the result
desired_index = pd.to_datetime(["2022-01-02", "2022-01-09", "2022-01-16", "2022-01-23"])

df

|    |   person_id | date                |   event_count |
|---:|------------:|:--------------------|--------------:|
|  0 |           0 | 2022-01-01 00:00:00 |             4 |
|  1 |           0 | 2022-01-02 00:00:00 |             5 |
|  2 |           0 | 2022-01-03 00:00:00 |             3 |
|  3 |           0 | 2022-01-04 00:00:00 |             5 |
|  4 |           0 | 2022-01-05 00:00:00 |             5 |
|  5 |           1 | 2022-01-06 00:00:00 |             2 |
|  6 |           1 | 2022-01-07 00:00:00 |             3 |
|  7 |           1 | 2022-01-08 00:00:00 |             3 |
|  8 |           1 | 2022-01-09 00:00:00 |             3 |
|  9 |           1 | 2022-01-10 00:00:00 |             5 |
| 10 |           2 | 2022-01-11 00:00:00 |             4 |
| 11 |           2 | 2022-01-12 00:00:00 |             3 |
| 12 |           2 | 2022-01-13 00:00:00 |             6 |
| 13 |           2 | 2022-01-14 00:00:00 |             5 |
| 14 |           2 | 2022-01-15 00:00:00 |             2 |

This is how my desired result looks like:

|    |   person_id | level_1             |   event_count |
|---:|------------:|:--------------------|--------------:|
|  0 |           0 | 2022-01-02 00:00:00 |             9 |
|  1 |           0 | 2022-01-09 00:00:00 |            13 |
|  2 |           0 | 2022-01-16 00:00:00 |             0 |
|  3 |           0 | 2022-01-23 00:00:00 |             0 |
|  4 |           1 | 2022-01-02 00:00:00 |             0 |
|  5 |           1 | 2022-01-09 00:00:00 |            11 |
|  6 |           1 | 2022-01-16 00:00:00 |             5 |
|  7 |           1 | 2022-01-23 00:00:00 |             0 |
|  8 |           2 | 2022-01-02 00:00:00 |             0 |
|  9 |           2 | 2022-01-09 00:00:00 |             0 |
| 10 |           2 | 2022-01-16 00:00:00 |            20 |
| 11 |           2 | 2022-01-23 00:00:00 |             0 |

I can produce it using:

(
    df
    .groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
    .groupby("person_id").apply(
        lambda df: (
            df
            .reset_index(drop=True, level=0)
            .reindex(desired_index, fill_value=0))
        )
    .reset_index()
)

However, according to the docs of reindex, I should be able to use it with level=1 as a kwarg directly and without having to do another groupby. However, when I do this I get an "inner join" of the two indices instead of an "outer join":

result = (
    df
    .groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
    .reindex(desired_index, level=1)
    .reset_index()
)

|    |   person_id | date                |   event_count |
|---:|------------:|:--------------------|--------------:|
|  0 |           0 | 2022-01-02 00:00:00 |             9 |
|  1 |           0 | 2022-01-09 00:00:00 |            13 |
|  2 |           1 | 2022-01-09 00:00:00 |            11 |
|  3 |           1 | 2022-01-16 00:00:00 |             5 |
|  4 |           2 | 2022-01-16 00:00:00 |            20 |

Why is that, and how am I supposed to use df.reindex correctly?

I have found a similar SO question on reindexing a multi-index level, but the accepted answer there uses df.unstack, which doesn't work for me, because not every level of my desired index occurs in my current index (and vice versa).

CodePudding user response：

You need reindex by both levels of MultiIndex:

mux = pd.MultiIndex.from_product([df['person_id'].unique(), desired_index], 
                                 names=['person_id','date'])
result = (
    df
    .groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
    .reindex(mux, fill_value=0)
    .reset_index()
)
print (result)
    person_id       date  event_count
0           0 2022-01-02            9
1           0 2022-01-09           13
2           0 2022-01-16            0
3           0 2022-01-23            0
4           1 2022-01-02            0
5           1 2022-01-09           11
6           1 2022-01-16            5
7           1 2022-01-23            0
8           2 2022-01-02            0
9           2 2022-01-09            0
10          2 2022-01-16           20
11          2 2022-01-23            0