This is a sample from my pandas df.
plant | device | quarter | avg_value | |
---|---|---|---|---|
17 | 4962 | 939322103 | 1999Q3 | 3624000.0 |
16 | 4962 | 939322103 | 1999Q2 | 4407000.0 |
15 | 4962 | 939322103 | 1999Q1 | 5065000.0 |
14 | 4962 | 460981301 | 2006Q2 | 390000.0 |
13 | 4962 | 460981301 | 2006Q1 | 407000.0 |
12 | 4962 | 460981301 | 2005Q4 | 336000.0 |
11 | 4962 | 45839N101 | 1999Q3 | 6977000.0 |
10 | 4962 | 387241102 | 1999Q3 | 95000.0 |
9 | 4962 | 387241102 | 1999Q1 | 57000.0 |
In the row with indexes 10
and 9
, for the same device 387241102
I've got the same device reporting data for avg_value
for 1999Q1
and 1999Q3
. The business logic states that if there is a missing data for one quarter that sits between two quarters that do have avg_value
present, then treat this case as missing data and impute it by creating a new row for the missing [plant, device, quarter]
combination and copying the avg_value
from the latest out of two quarters for the same [plant, device, quarter]
. In this case, the copying avg_value
from 1999Q3
Meaning, for the case above, I'd need to create a new entry for the 1999Q2
in my pandas df like so.
plant | device | quarter | avg_value | |
---|---|---|---|---|
4962 | 387241102 | 1999Q2 | 95000.0 |
I'm struggling with identifying these cases with missing quarter sitting in between two that do have the data ... any help would be appreciated.
CodePudding user response:
Annotated code
# Create a period index by parsing quarter
idx = pd.PeriodIndex(df['quarter'], freq='Q')
# Set the index of dataframe to period idx and sort
df1 = df.drop('quarter', axis=1).set_index(idx).sort_index()
# Group the dataframe by [plant, device] and resample
# with quaterly frequency then use bfill to fill resampled
# NaN values with the values from latest quater
df1 = df1.groupby(['plant', 'device'], group_keys=False).resample('Q').bfill().reset_index()
Result
quarter plant device avg_value
0 1999Q1 4962 387241102 57000.0
1 1999Q2 4962 387241102 95000.0
2 1999Q3 4962 387241102 95000.0
3 1999Q3 4962 45839N101 6977000.0
4 2005Q4 4962 460981301 336000.0
5 2006Q1 4962 460981301 407000.0
6 2006Q2 4962 460981301 390000.0
7 1999Q1 4962 939322103 5065000.0
8 1999Q2 4962 939322103 4407000.0
9 1999Q3 4962 939322103 3624000.0
CodePudding user response:
Pandas has several functions that are very convenient to manipulate time series. So one way to do this is to index your data by plant-device-quarter, then loop over all plant-device combinations, and use pd.PeriodIndex and df.resample to get your results, like this:
import pandas as pd
from io import StringIO
datastr = StringIO("""plant,device,quarter,avg_value
4962,939322103,1999Q3,3624000.0
4962,939322103,1999Q2,4407000.0
4962,939322103,1999Q1,5065000.0
4962,460981301,2006Q2,390000.0
4962,460981301,2006Q1,407000.0
4962,460981301,2005Q4,336000.0
4962,45839N101,1999Q3,6977000.0
4962,387241102,1999Q3,95000.0
4962,387241102,1999Q1,57000.0
""")
df = pd.read_csv(datastr, index_col=[0,1,2]).sort_index()
def fill_quarter_gaps(sdf):
sdf2 = sdf.reset_index(level=['plant','device'], drop=True)
sdf2.index = pd.PeriodIndex(sdf2.index, freq='Q')
return sdf2.resample('Q').bfill()
df.groupby(level=[0,1]).apply(fill_quarter_gaps)
The output will be indexed by plant-device-quarter, but you can reset the index if you want to move those variables to regular columns.