Home > Mobile >  How to convert one record per change to continuous data?
How to convert one record per change to continuous data?

Time:09-25

My data looks like this:
print(df)

DateTime,      Status
'2021-09-01',  0
'2021-09-05',  1
'2021-09-07',  0

And I need it to look like this:
print(df_desired)

DateTime,      Status
'2021-09-01',  0
'2021-09-02',  0
'2021-09-03',  0
'2021-09-04',  0
'2021-09-05',  1
'2021-09-06',  1
'2021-09-07',  0

Right now I accomplish this using pandas like this:

new_index = pd.DataFrame(index = pd.date_range(df.index[0], df.index[-1], freq='D'))
df = new_index.join(df).ffill()

Missing values before the first record in any column are imputed using the inverse of the first record in that column because it's binary and only shows change-points this is guaranteed to be correct.

To my understanding my desired dataframe contained "continuous" data, but I'm not sure what to call the data structure in my source data.

The problem:
When I do this to a dataframe that has a frequency of one record per second and I want to load a year's worth of data my memory overflows (92GB required, ~60GB available). I'm not sure if there is a standard procedure instead of my solution that I don't know the name of and cannot find using google or that I'm using the join method wrong, but this seems horribly inefficient, the resulting dataframe is only a few 100 megabytes after this operation. Any feedback on this would be great!

CodePudding user response:

You can use this pipeline:

(df.set_index('DateTime')
   .reindex(pd.date_range(df['DateTime'].min(), df['DateTime'].max()))
   .rename_axis('DateTime')
   .ffill(downcast='infer')
   .reset_index()
)

output:

    DateTime  Status
0 2021-09-01       0
1 2021-09-02       0
2 2021-09-03       0
3 2021-09-04       0
4 2021-09-05       1
5 2021-09-06       1
6 2021-09-07       0

input:

    DateTime  Status
0 2021-09-01       0
1 2021-09-05       1
2 2021-09-07       0

CodePudding user response:

Use DataFrame.asfreq working with DatetimeIndex:

df = df.set_index('DateTime').asfreq('d', method='ffill').reset_index()
print (df)
    DateTime  Status
0 2021-09-01       0
1 2021-09-02       0
2 2021-09-03       0
3 2021-09-04       0
4 2021-09-05       1
5 2021-09-06       1
6 2021-09-07       0
  • Related