Identify rows with missing records for a YYYYQ quarter that sits in between two quarters for which r-CodePudding

This is a sample from my pandas df.

	plant	device	quarter	avg_value
17	4962	939322103	1999Q3	3624000.0
16	4962	939322103	1999Q2	4407000.0
15	4962	939322103	1999Q1	5065000.0
14	4962	460981301	2006Q2	390000.0
13	4962	460981301	2006Q1	407000.0
12	4962	460981301	2005Q4	336000.0
11	4962	45839N101	1999Q3	6977000.0
10	4962	387241102	1999Q3	95000.0
9	4962	387241102	1999Q1	57000.0

In the row with indexes 10 and 9, for the same device 387241102 I've got the same device reporting data for avg_value for 1999Q1 and 1999Q3. The business logic states that if there is a missing data for one quarter that sits between two quarters that do have avg_value present, then treat this case as missing data and impute it by creating a new row for the missing [plant, device, quarter] combination and copying the avg_value from the latest out of two quarters for the same [plant, device, quarter]. In this case, the copying avg_value from 1999Q3

Meaning, for the case above, I'd need to create a new entry for the 1999Q2 in my pandas df like so.

	plant	device	quarter	avg_value
	4962	387241102	1999Q2	95000.0

I'm struggling with identifying these cases with missing quarter sitting in between two that do have the data ... any help would be appreciated.

CodePudding user response：

Annotated code

# Create a period index by parsing quarter
idx = pd.PeriodIndex(df['quarter'], freq='Q')

# Set the index of dataframe to period idx and sort
df1 = df.drop('quarter', axis=1).set_index(idx).sort_index()

# Group the dataframe by [plant, device] and resample
# with quaterly frequency then use bfill to fill resampled
# NaN values with the values from latest quater
df1 = df1.groupby(['plant', 'device'], group_keys=False).resample('Q').bfill().reset_index()

Result

  quarter  plant     device  avg_value
0  1999Q1   4962  387241102    57000.0
1  1999Q2   4962  387241102    95000.0
2  1999Q3   4962  387241102    95000.0
3  1999Q3   4962  45839N101  6977000.0
4  2005Q4   4962  460981301   336000.0
5  2006Q1   4962  460981301   407000.0
6  2006Q2   4962  460981301   390000.0
7  1999Q1   4962  939322103  5065000.0
8  1999Q2   4962  939322103  4407000.0
9  1999Q3   4962  939322103  3624000.0

CodePudding user response：

Pandas has several functions that are very convenient to manipulate time series. So one way to do this is to index your data by plant-device-quarter, then loop over all plant-device combinations, and use pd.PeriodIndex and df.resample to get your results, like this:

import pandas as pd
from io import StringIO

datastr = StringIO("""plant,device,quarter,avg_value
4962,939322103,1999Q3,3624000.0
4962,939322103,1999Q2,4407000.0
4962,939322103,1999Q1,5065000.0
4962,460981301,2006Q2,390000.0
4962,460981301,2006Q1,407000.0
4962,460981301,2005Q4,336000.0
4962,45839N101,1999Q3,6977000.0
4962,387241102,1999Q3,95000.0
4962,387241102,1999Q1,57000.0
""")

df = pd.read_csv(datastr, index_col=[0,1,2]).sort_index()

def fill_quarter_gaps(sdf):
    sdf2 = sdf.reset_index(level=['plant','device'], drop=True)
    sdf2.index = pd.PeriodIndex(sdf2.index, freq='Q')
    return sdf2.resample('Q').bfill()

df.groupby(level=[0,1]).apply(fill_quarter_gaps)

The output will be indexed by plant-device-quarter, but you can reset the index if you want to move those variables to regular columns.