How do I conditionally resample data into hourly values only when enough data is present?-CodePudding

My first time trying something more tricky and coming on here for advice. I am trying to resample my data that is at times in 10 minute increments in such a way that I obtain hourly values only if half or more of the data is an actual value instead of a NaN value. How do I still return a NaN value if more than 50% is NaN? So that I still have a continuous time series (at least when it comes to timestamps).

I am working in Python 3 via a jupyterlab notebook (not sure if that makes a difference).

Here's a small example of what I am trying to do:

Let's say that for an hour the rain gauge I am using needs to be repaired and therefore doesnt record any data:

Datetime          Precip
2014-07-01 00:00  0.2
2014-07-01 00:10  0.3
2014-07-01 00:20  0.1
2014-07-01 00:30  0.0
2014-07-01 00:40  NaN
2014-07-01 00:50  NaN
2014-07-01 01:00  NaN
2014-07-01 01:10  NaN
2014-07-01 01:20  NaN
2014-07-01 01:30  NaN
2014-07-01 01:40  0.0
2014-07-01 01:50  0.0

I would want the resample to return the following output where the first hour is resampled (since it has more than 50% data) and for the second hour to be returned as a NaN.

Datetime          Precip
2014-07-01 00:00  0.15
2014-07-01 00:10  NaN

So far I have just resampled as such:

df['Precip'].resample(rule = 'h').mean()

However that obviously resamples everything.

CodePudding user response：

I have found a solution to my issue with the following function:

In the function, the perc_of_NaN is the threshhold of NaN values above which you want to exclude any averaging. eg. You don't want resampling to be done if more than 60% of datapoints are NaN, then perc_of_NaN would be 60.

Additionally, Averaging_number is the amount of rows (could be timesteps/whatever your data is) that you want to resample together.

def NaN_Resample(df,perc_of_NaN,Averaging_number):
    rows = int(len(df)/Averaging_number)
    columns = int(len(df.columns))
    Output = pd.DataFrame(np.zeros((rows,columns)))
    Headers = list(df.columns)
    Output = Output.set_axis(Headers, axis=1)
    dec_of_NaN = perc_of_NaN/100
    for x in range (columns):
        for y in range (rows):
            if df.iloc[(y*Averaging_number):((y*Averaging_number) (Averaging_number)),x].isna().sum()>(Averaging_number*dec_of_NaN): 
                Output.iloc[y,x] == 'NaN'
            Output.iloc[y,x] = df.iloc[(y*Averaging_number):((y*Averaging_number) (Averaging_number)),x].mean()
    Output.index = pd.to_datetime(Output[Headers[0]])      
    Output = Output.resample(rule = 'd').mean()
    return Output

Feel free to use the function/ask any questions about it if it is helpful to you :)