Home > database >  Identify rows with missing records for a YYYYQ quarter that sits in between two quarters for which r
Identify rows with missing records for a YYYYQ quarter that sits in between two quarters for which r

Time:08-24

This is a sample from my pandas df.

plant device quarter avg_value
17 4962 939322103 1999Q3 3624000.0
16 4962 939322103 1999Q2 4407000.0
15 4962 939322103 1999Q1 5065000.0
14 4962 460981301 2006Q2 390000.0
13 4962 460981301 2006Q1 407000.0
12 4962 460981301 2005Q4 336000.0
11 4962 45839N101 1999Q3 6977000.0
10 4962 387241102 1999Q3 95000.0
9 4962 387241102 1999Q1 57000.0

In the row with indexes 10 and 9, for the same device 387241102 I've got the same device reporting data for avg_value for 1999Q1 and 1999Q3. The business logic states that if there is a missing data for one quarter that sits between two quarters that do have avg_value present, then treat this case as missing data and impute it by creating a new row for the missing [plant, device, quarter] combination and copying the avg_value from the latest out of two quarters for the same [plant, device, quarter]. In this case, the copying avg_value from 1999Q3

Meaning, for the case above, I'd need to create a new entry for the 1999Q2 in my pandas df like so.

plant device quarter avg_value
4962 387241102 1999Q2 95000.0

I'm struggling with identifying these cases with missing quarter sitting in between two that do have the data ... any help would be appreciated.

CodePudding user response:

Annotated code

# Create a period index by parsing quarter
idx = pd.PeriodIndex(df['quarter'], freq='Q')

# Set the index of dataframe to period idx and sort
df1 = df.drop('quarter', axis=1).set_index(idx).sort_index()

# Group the dataframe by [plant, device] and resample
# with quaterly frequency then use bfill to fill resampled
# NaN values with the values from latest quater
df1 = df1.groupby(['plant', 'device'], group_keys=False).resample('Q').bfill().reset_index()

Result

  quarter  plant     device  avg_value
0  1999Q1   4962  387241102    57000.0
1  1999Q2   4962  387241102    95000.0
2  1999Q3   4962  387241102    95000.0
3  1999Q3   4962  45839N101  6977000.0
4  2005Q4   4962  460981301   336000.0
5  2006Q1   4962  460981301   407000.0
6  2006Q2   4962  460981301   390000.0
7  1999Q1   4962  939322103  5065000.0
8  1999Q2   4962  939322103  4407000.0
9  1999Q3   4962  939322103  3624000.0

CodePudding user response:

Pandas has several functions that are very convenient to manipulate time series. So one way to do this is to index your data by plant-device-quarter, then loop over all plant-device combinations, and use pd.PeriodIndex and df.resample to get your results, like this:

import pandas as pd
from io import StringIO

datastr = StringIO("""plant,device,quarter,avg_value
4962,939322103,1999Q3,3624000.0
4962,939322103,1999Q2,4407000.0
4962,939322103,1999Q1,5065000.0
4962,460981301,2006Q2,390000.0
4962,460981301,2006Q1,407000.0
4962,460981301,2005Q4,336000.0
4962,45839N101,1999Q3,6977000.0
4962,387241102,1999Q3,95000.0
4962,387241102,1999Q1,57000.0
""")

df = pd.read_csv(datastr, index_col=[0,1,2]).sort_index()

def fill_quarter_gaps(sdf):
    sdf2 = sdf.reset_index(level=['plant','device'], drop=True)
    sdf2.index = pd.PeriodIndex(sdf2.index, freq='Q')
    return sdf2.resample('Q').bfill()

df.groupby(level=[0,1]).apply(fill_quarter_gaps)

The output will be indexed by plant-device-quarter, but you can reset the index if you want to move those variables to regular columns.

  • Related