This has been asked before and a working solution has been proposed here Pandas reindex dates in Groupby, which worked for me in the past, but it does not work any more.
So, to recap I need to reindex dataframe using date to create 'balanced panel' - not to have missing Date-Value combination in any Group. Here is an example:
import pandas as pd
from datetime import datetime
date1 = datetime.strptime('2023-01-01', '%Y-%m-%d').date()
date2 = datetime.strptime('2023-01-02', '%Y-%m-%d').date()
date3 = datetime.strptime('2023-01-03', '%Y-%m-%d').date()
df = pd.DataFrame({'Date':[date1] * 3 [date2] [date3] * 3,
'Group':['A', 'B', 'C', 'A', 'A', 'B', 'C'],
'Value':[20, 10, 23, 45, 60, 14, 25]})
df.set_index('Date', inplace=True)
Desired output is:
df_target = pd.DataFrame({'Date':[date1] * 3 [date2] * 3 [date3] * 3,
'Group':['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Value':[20, 10, 23, 45, 0, 0, 60, 14, 25]})
df_target.set_index('Date', inplace=True)
Attempted solution (note the assertion):
def reindex_by_date(df, freq):
dates = pd.date_range(start=df.index.min(), end=df.index.max(), freq=freq)
idx = pd.Index(dates, name='Dates')
assert dates.duplicated().sum()==0
return df.reindex(dates, fill_value=0)
df.groupby('Group').apply(reindex_by_date(df, freq='D'))
# this has also been added: .reset_index(drop=True)
Produces an error:
ValueError: cannot reindex from a duplicate axis
I even checked the flags (here it is True
):
df.flags.allows_duplicate_labels
CodePudding user response:
What about:
idx = pd.MultiIndex.from_product(
[df.index.unique(), df["Group"].unique()],
names=["Date", "Group"]
)
out = (
df
.set_index("Group", append=True)
.reindex(idx, fill_value=0)
.reset_index(level=1)
)
out:
Group Value
Date
2023-01-01 A 20
2023-01-01 B 10
2023-01-01 C 23
2023-01-02 A 45
2023-01-02 B 0
2023-01-02 C 0
2023-01-03 A 60
2023-01-03 B 14
2023-01-03 C 25
CodePudding user response:
You are calling the function incorrectly in apply
(you don't pass the group but rather the whole DataFrame).
This should be:
df.groupby('Group').apply(lambda g: reindex_by_date(g, freq='D'))
Or:
df.groupby('Group').apply(reindex_by_date, freq='D')
Output:
Group Value
Group
A 2023-01-01 A 20
2023-01-02 A 45
2023-01-03 A 60
B 2023-01-01 B 10
2023-01-02 0 0
2023-01-03 B 14
C 2023-01-01 C 23
2023-01-02 0 0
2023-01-03 C 25
Note that you'll have to drop Group
and reset_index
to avoid the reindexing with 0 in Group
as column