Home > OS >  How to detect missing values in a time series for long periods
How to detect missing values in a time series for long periods

Time:02-21

I have electrical consumption data between 2016-2019. The data was recorded every 30 minutes for 4 years. There is no data between 13/03/2019 - 31/03/209.

enter image description here

  • I want to ask how I can detect this miss by coding without visualization because I have 12 countries and they may have such missing values in other months and they are not visible. (Detect if there is a miss for more than 3 consecutive days). Thank you for your help!

This is the data:

        Country Code   Electric Consumption (MW)
Date (index)        
2016-01-01    84              354642.0
2016-01-02    84              376207.0
2016-01-03    84              381534.0
2016-01-04    84              435561.0
2016-01-05    84              447820.0

... ... ...
2019-12-27    12              374340.0
2019-12-28    12              372761.0
2019-12-29    12              379411.0
2019-12-30    12              416044.0
2019-12-31    12              87519.0

CodePudding user response:

Here is a way to identify the 3 day gaps and fill them. Note that this is working on each unique country code. You can save the final_dfs with a list and use pd.concat() if you need to get them back together again.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Country Code': {'2016-01-01': 84,
  '2016-01-02': 84,
  '2016-01-03': 84,
  '2016-01-04': 84,
  '2016-01-05': 84,
  '2019-12-27': 12,
  '2019-12-28': 12,
  '2019-12-29': 12,
  '2019-12-30': 12,
  '2019-12-31': 12},
 'Electric Consumption (MW)': {'2016-01-01': 354642.0,
  '2016-01-02': 376207.0,
  '2016-01-03': 381534.0,
  '2016-01-04': 435561.0,
  '2016-01-05': 447820.0,
  '2019-12-27': 374340.0,
  '2019-12-28': 372761.0,
  '2019-12-29': 379411.0,
  '2019-12-30': 416044.0,
  '2019-12-31': 87519.0}})


# change index value for fake gap
df.index = ['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2019-12-27', '2019-12-28',
               '2019-12-29', '2019-12-30', '2020-01-06']

#convert object dates to datetime
df.index = pd.to_datetime(df.index)


for g in df['Country Code'].unique():
    
    # look at each unique country 
    country_slice = df.loc[df['Country Code'] == g]
    # use timedelta to identify 3 day gaps
    country_slice['Three Day Gap'] = country_slice.index.to_series().diff() > pd.Timedelta('3d')
    # create a new index with previous min and max
    idx = pd.date_range(country_slice.index.min(), country_slice.index.max())

    s = country_slice['Electric Consumption (MW)']

    s.index = pd.DatetimeIndex(s.index)
    # this gives us a series with new rows and nans for the missing dates
    s = s.reindex(idx, fill_value=np.nan)
    # join the old data back to nex index
    country_slice_join = country_slice.join(s, how='outer', lsuffix='L')
    # now we can interpolate as missing dates are new rows
    country_slice_join['interpolate'] = country_slice_join['Electric Consumption (MW)'].interpolate(method='linear', axis=0)
    
    country_slice_join['Country Code'] = country_slice_join['Country Code'].ffill()
    # remove temp columns
    final_df = country_slice_join[['Country Code', 'interpolate']]
    
    final_df.columns = ['Country Code', 'Electric Consumption (MW)']

Example country_slice_join output before final_df:

           Country Code Electric Consumption (MW)L  Three Day Gap   Electric Consumption (MW)   interpolate
2019-12-27  12.0    374340.0    False   374340.0    374340.000000
2019-12-28  12.0    372761.0    False   372761.0    372761.000000
2019-12-29  12.0    379411.0    False   379411.0    379411.000000
2019-12-30  12.0    416044.0    False   416044.0    416044.000000
2019-12-31  12.0    NaN         NaN     NaN         369111.857143
2020-01-01  12.0    NaN         NaN     NaN         322179.714286
2020-01-02  12.0    NaN         NaN     NaN         275247.571429
2020-01-03  12.0    NaN         NaN     NaN         228315.428571
2020-01-04  12.0    NaN         NaN     NaN         181383.285714
2020-01-05  12.0    NaN         NaN     NaN         134451.142857
2020-01-06  12.0    87519.0     True    87519.0     87519.000000

Example final_df output without temporary columns:

         Country Code   Electric Consumption (MW)
2019-12-27  12.0    374340.000000
2019-12-28  12.0    372761.000000
2019-12-29  12.0    379411.000000
2019-12-30  12.0    416044.000000
2019-12-31  12.0    369111.857143
2020-01-01  12.0    322179.714286
2020-01-02  12.0    275247.571429
2020-01-03  12.0    228315.428571
2020-01-04  12.0    181383.285714
2020-01-05  12.0    134451.142857
2020-01-06  12.0    87519.000000

CodePudding user response:

You are the best judge in this situation, and you'll have to decide based on what are the best results for what you're trying to do. By the way, you are trying to fill missing values and not to identify them.

There are a some options inside the DataFrame.interpolate() functions and you can find that here.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html?highlight=interpolate#pandas.DataFrame.interpolate

There are other options that you don't actually do anything with the data, you just copy values from adjacent rows, like DataFrame.ffill() and DataFrame.bfill().

  • Related