Appending a record using pandas-CodePudding

I have 5 minute data uploaded from my solar inverter over 24 hours which should equate to 288 records. There are days when the dataset does not contain 288 records, usually 285 to 287 records. It seems to miss something almost everyday. I have code that uses the 5 minute data to create daily records but they are inaccurate if not all the 5 minute data is there. I have been checking manually on a daily basis and adding the missing records manually by simply copying the previous record if one does not exist. I know it is not prefect but it is better have having no record and under reporting.

The dataset comes in with comma(,) between the values in each record and semicolon(;) between records. i.e. for 3 records

20211002,15:35,27294,3.446,1080,1080,0.136,15380,600,NaN,244.4,0.000,0.000,0.000,1080.000,0.000,100.000;20211002,15:30,27204,3.435,1830,1836,0.232,15330,660,NaN,246.0,0.000,0.000,0.000,1830.000,0.000,100.000;20211002,15:25,27051,3.416,4820,4824,0.609,15275,1580,NaN,249.2,0.000,0.000,0.000,4820.000,0.000,100.000

I then convert this to an array using the split function

with open(r'C:\Users\david\GrowattAPI\5minutecheck.csv') as f:
    array = [l.split(",") for l in f.readline().split(";") if l]

Which gives

[['20211002', '15:35', '27294', '3.446', '1080', '1080', '0.136', '15380', '600', 'NaN', '244.4', '0.000', '0.000', '0.000', '1080.000', '0.000', '100.000'], ['20211002', '15:30', '27204', '3.435', '1830', '1836', '0.232', '15330', '660', 'NaN', '246.0', '0.000', '0.000', '0.000', '1830.000', '0.000', '100.000'], ['20211002', '15:25', '27051', '3.416', '4820', '4824', '0.609', '15275', '1580', 'NaN', '249.2', '0.000', '0.000', '0.000', '4820.000', '0.000', '100.000']]

I then use

df = pd.DataFrame(array, columns=['DATE', 'TIME', 'ENERGY_OUT', 'AAA', 'POWER_OUT', 'BBB', 'CCC', 'ENERGY_IN', 'POWER_IN', 'TEMPERATURE', 'VOLTAGE', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12'])

to calculate the daily figures.

What I am trying to do is check that there are 288 records by comparing the time stamps are 5 minutes apart. I can address using

array[0][1] array[1][1] . . . array[n][1]

but I can't work out to check then append using df. All I want to do is if the 5 minute data is not there then just copy the previous record and add it to the array.

Any help appreciated.

David

CodePudding user response：

I've simulated the case when you have one reading missing at 15:35.

columns = ['DATE', 'TIME', 'ENERGY_OUT', 'AAA', 'POWER_OUT', 'BBB', 'CCC', 'ENERGY_IN',
           'POWER_IN', 'TEMPERATURE', 'VOLTAGE', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12']

array = [
    ['20211002', '15:40', '27294', '3.446', '1080', '1080', '0.136', '15380', '600',
        'NaN', '244.4', '0.000', '0.000', '0.000', '1080.000', '0.000', '100.000'],
    ['20211002', '15:30', '27204', '3.435', '1830', '1836', '0.232', '15330', '660',
        'NaN', '246.0', '0.000', '0.000', '0.000', '1830.000', '0.000', '100.000'],
    ['20211002', '15:25', '27051', '3.416', '4820', '4824', '0.609', '15275', '1580',
        'NaN', '249.2', '0.000', '0.000', '0.000', '4820.000', '0.000', '100.000']
]

By resampling every 5 min with forward fill (previous value is used) you can fill the gaps.

df = pd.DataFrame(array, columns=columns)
df['DATETIME'] = pd.to_datetime(df['DATE'] ' ' df['TIME'])
df = df.sort_values('DATETIME').set_index('DATETIME')
df_ffill = df.resample('5Min').ffill()

Results in:

                         DATE   TIME ENERGY_OUT    AAA POWER_OUT   BBB  ...     v7     v8     v9       v10    v11      v12
DATETIME                                                                ...                                               
2021-10-02 15:25:00  20211002  15:25      27051  3.416      4820  4824  ...  0.000  0.000  0.000  4820.000  0.000  100.000
2021-10-02 15:30:00  20211002  15:30      27204  3.435      1830  1836  ...  0.000  0.000  0.000  1830.000  0.000  100.000
2021-10-02 15:35:00  20211002  15:30      27204  3.435      1830  1836  ...  0.000  0.000  0.000  1830.000  0.000  100.000
2021-10-02 15:40:00  20211002  15:40      27294  3.446      1080  1080  ...  0.000  0.000  0.000  1080.000  0.000  100.000

[4 rows x 17 columns]

DATE and TIME columns values are still from the original data, but DATETIME one is updated with the resampling step.