I have 5 minute data uploaded from my solar inverter over 24 hours which should equate to 288 records. There are days when the dataset does not contain 288 records, usually 285 to 287 records. It seems to miss something almost everyday. I have code that uses the 5 minute data to create daily records but they are inaccurate if not all the 5 minute data is there. I have been checking manually on a daily basis and adding the missing records manually by simply copying the previous record if one does not exist. I know it is not prefect but it is better have having no record and under reporting.
The dataset comes in with comma(,) between the values in each record and semicolon(;) between records. i.e. for 3 records
20211002,15:35,27294,3.446,1080,1080,0.136,15380,600,NaN,244.4,0.000,0.000,0.000,1080.000,0.000,100.000;20211002,15:30,27204,3.435,1830,1836,0.232,15330,660,NaN,246.0,0.000,0.000,0.000,1830.000,0.000,100.000;20211002,15:25,27051,3.416,4820,4824,0.609,15275,1580,NaN,249.2,0.000,0.000,0.000,4820.000,0.000,100.000
I then convert this to an array using the split function
with open(r'C:\Users\david\GrowattAPI\5minutecheck.csv') as f:
array = [l.split(",") for l in f.readline().split(";") if l]
Which gives
[['20211002', '15:35', '27294', '3.446', '1080', '1080', '0.136', '15380', '600', 'NaN', '244.4', '0.000', '0.000', '0.000', '1080.000', '0.000', '100.000'], ['20211002', '15:30', '27204', '3.435', '1830', '1836', '0.232', '15330', '660', 'NaN', '246.0', '0.000', '0.000', '0.000', '1830.000', '0.000', '100.000'], ['20211002', '15:25', '27051', '3.416', '4820', '4824', '0.609', '15275', '1580', 'NaN', '249.2', '0.000', '0.000', '0.000', '4820.000', '0.000', '100.000']]
I then use
df = pd.DataFrame(array, columns=['DATE', 'TIME', 'ENERGY_OUT', 'AAA', 'POWER_OUT', 'BBB', 'CCC', 'ENERGY_IN', 'POWER_IN', 'TEMPERATURE', 'VOLTAGE', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12'])
to calculate the daily figures.
What I am trying to do is check that there are 288 records by comparing the time stamps are 5 minutes apart. I can address using
array[0][1] array[1][1] . . . array[n][1]
but I can't work out to check then append using df. All I want to do is if the 5 minute data is not there then just copy the previous record and add it to the array.
Any help appreciated.
David
CodePudding user response:
I've simulated the case when you have one reading missing at 15:35.
columns = ['DATE', 'TIME', 'ENERGY_OUT', 'AAA', 'POWER_OUT', 'BBB', 'CCC', 'ENERGY_IN',
'POWER_IN', 'TEMPERATURE', 'VOLTAGE', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12']
array = [
['20211002', '15:40', '27294', '3.446', '1080', '1080', '0.136', '15380', '600',
'NaN', '244.4', '0.000', '0.000', '0.000', '1080.000', '0.000', '100.000'],
['20211002', '15:30', '27204', '3.435', '1830', '1836', '0.232', '15330', '660',
'NaN', '246.0', '0.000', '0.000', '0.000', '1830.000', '0.000', '100.000'],
['20211002', '15:25', '27051', '3.416', '4820', '4824', '0.609', '15275', '1580',
'NaN', '249.2', '0.000', '0.000', '0.000', '4820.000', '0.000', '100.000']
]
By resampling every 5 min with forward fill (previous value is used) you can fill the gaps.
df = pd.DataFrame(array, columns=columns)
df['DATETIME'] = pd.to_datetime(df['DATE'] ' ' df['TIME'])
df = df.sort_values('DATETIME').set_index('DATETIME')
df_ffill = df.resample('5Min').ffill()
Results in:
DATE TIME ENERGY_OUT AAA POWER_OUT BBB ... v7 v8 v9 v10 v11 v12
DATETIME ...
2021-10-02 15:25:00 20211002 15:25 27051 3.416 4820 4824 ... 0.000 0.000 0.000 4820.000 0.000 100.000
2021-10-02 15:30:00 20211002 15:30 27204 3.435 1830 1836 ... 0.000 0.000 0.000 1830.000 0.000 100.000
2021-10-02 15:35:00 20211002 15:30 27204 3.435 1830 1836 ... 0.000 0.000 0.000 1830.000 0.000 100.000
2021-10-02 15:40:00 20211002 15:40 27294 3.446 1080 1080 ... 0.000 0.000 0.000 1080.000 0.000 100.000
[4 rows x 17 columns]
DATE and TIME columns values are still from the original data, but DATETIME one is updated with the resampling step.