My first time trying something more tricky and coming on here for advice. I am trying to resample my data that is at times in 10 minute increments in such a way that I obtain hourly values only if half or more of the data is an actual value instead of a NaN value. How do I still return a NaN value if more than 50% is NaN? So that I still have a continuous time series (at least when it comes to timestamps).
I am working in Python 3 via a jupyterlab notebook (not sure if that makes a difference).
Here's a small example of what I am trying to do:
Let's say that for an hour the rain gauge I am using needs to be repaired and therefore doesnt record any data:
Datetime Precip
2014-07-01 00:00 0.2
2014-07-01 00:10 0.3
2014-07-01 00:20 0.1
2014-07-01 00:30 0.0
2014-07-01 00:40 NaN
2014-07-01 00:50 NaN
2014-07-01 01:00 NaN
2014-07-01 01:10 NaN
2014-07-01 01:20 NaN
2014-07-01 01:30 NaN
2014-07-01 01:40 0.0
2014-07-01 01:50 0.0
I would want the resample to return the following output where the first hour is resampled (since it has more than 50% data) and for the second hour to be returned as a NaN.
Datetime Precip
2014-07-01 00:00 0.15
2014-07-01 00:10 NaN
So far I have just resampled as such:
df['Precip'].resample(rule = 'h').mean()
However that obviously resamples everything.
CodePudding user response:
I have found a solution to my issue with the following function:
In the function, the perc_of_NaN is the threshhold of NaN values above which you want to exclude any averaging. eg. You don't want resampling to be done if more than 60% of datapoints are NaN, then perc_of_NaN would be 60.
Additionally, Averaging_number is the amount of rows (could be timesteps/whatever your data is) that you want to resample together.
def NaN_Resample(df,perc_of_NaN,Averaging_number):
rows = int(len(df)/Averaging_number)
columns = int(len(df.columns))
Output = pd.DataFrame(np.zeros((rows,columns)))
Headers = list(df.columns)
Output = Output.set_axis(Headers, axis=1)
dec_of_NaN = perc_of_NaN/100
for x in range (columns):
for y in range (rows):
if df.iloc[(y*Averaging_number):((y*Averaging_number) (Averaging_number)),x].isna().sum()>(Averaging_number*dec_of_NaN):
Output.iloc[y,x] == 'NaN'
Output.iloc[y,x] = df.iloc[(y*Averaging_number):((y*Averaging_number) (Averaging_number)),x].mean()
Output.index = pd.to_datetime(Output[Headers[0]])
Output = Output.resample(rule = 'd').mean()
return Output
Feel free to use the function/ask any questions about it if it is helpful to you :)