I have the following dataframe, the date corresponds to quarterly periods and the amount (and other additional columns not shown here for simplification) corresponding to the associated id grouping. Dates are unique per id.
import pandas as pd
from numpy import nan
d = {'id': ['a', 'a', 'a', 'b', 'b'], 'date': ['2020-09-30', '2020-06-30', '2020-03-31',
'2020-09-30', '2020-06-30'], 'amount': [1, 2, nan , 5, nan]}
df = pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['date'])
df
id date amount
0 a 2020-09-30 1
1 a 2020-06-30 2
2 a 2020-03-31
3 b 2020-09-30 5
4 b 2020-06-30
I would like to extend the period of time that the quarters start by a set start quarter date in the past so that the quarter periods start and exist for the same dates for each id. In this example the data should start at 2019-12-31
and any quarters in between should be filled in if the row is missing.
I would like to backfill the missing values (such as amount) based on the most recent present value.
The output would look like this:
id date amount
a 2020-09-30 1
a 2020-06-30 2
a 2020-03-31 2
a 2019-12-31 2
b 2020-09-30 5
b 2020-06-30 5
b 2020-03-31 5
b 2019-12-31 5
What is the best way to go about this?
CodePudding user response:
You need to define a new time range every third month starting from 2019-12-31 and reindex your dataframe. Then fill the NaN
values with a backward fill bfill
method. See code below with comments.
import pandas as pd
# Create the DataFrame according to your question
d = {'id': ['a', 'a', 'a', 'b', 'b'], 'date': ['2020-09-30', '2020-06-30', '2020-03-31',
'2020-09-30', '2020-06-30'], 'amount': [1, 2, None, 5, None]}
df = pd.DataFrame(data=d)
# Transform date to datetime column
df['date'] = pd.to_datetime(df['date'])
# Set multiindex to (id, date) as they are the "unique keys" of your amount values
df.set_index(['id', 'date'], inplace=True)
# Define new period for the datetime index (every 3 months)
index = pd.date_range('2019-12-31', '2020-09-30', freq='3M')
# Reindex the Dataframe and fill NaNs with a backward method
print(df.reindex(pd.MultiIndex.from_product([df.index.get_level_values(0).unique(), index])).fillna(method='bfill'))
# Output
# amount
# id
# a 2019-12-31 2.0
# 2020-03-31 2.0
# 2020-06-30 2.0
# 2020-09-30 1.0
# b 2019-12-31 5.0
# 2020-03-31 5.0
# 2020-06-30 5.0
# 2020-09-30 5.0
Note: Setting the multiindex including the id
column is needed as you want to resample your date without merging id
values a
and b
.
CodePudding user response:
The complete function from pyjanitor offers an abstraction for exposing missing rows:
# pip install pyjanitor
import pandas as pd
import numpy as np
# create a mapping of the new dates
# reusing @scandav's index variable
index = pd.date_range('2019-12-31', '2020-09-30', freq='3M')
index = dict(date = index)
(df.complete('id', index)
.sort_values(['id', 'date'], ascending = [True, False])
.ffill(downcast = 'infer')
)
id date amount
0 a 2020-09-30 1
1 a 2020-06-30 2
2 a 2020-03-31 2
5 a 2019-12-31 2
3 b 2020-09-30 5
4 b 2020-06-30 5
7 b 2020-03-31 5
6 b 2019-12-31 5