Drop rows of dataframe if the rows have continuously the same value-CodePudding

I am dealing with metered time series data, that should not have the exact same value for more than n steps. I want to build a script that, given a threshold n, splits the timeseries, if for more than n steps the same value has been recorded. So for example: threshold = 3

metered_dataframe :

expected result after script:

cleaned_dataframe_1:

   values
0    0.23
1    0.24
2    0.21
3    0.19

cleaned_dataframe_2:

   values
0    0.11
1    0.14
2    0.12

Since more than three rows had the same value, the identical values have been deleted and two new dataframes have been created. If at some point later there would be repeating same values again i would need to split it into a third dataframe etc.

What I though of so far:

functions to identify same value repeats over a given list:

def one_value(L):
    return all(x == y for x, y in zip(L, L[1:]))

timeseries_dummy_1 = [0.23,0.24,0.21,0.19,0,0,0,0,0,0.11,0.14,0.12]
df_1 = pd.DataFrame(timeseries_dummy_1,columns = ["values"])

threshold = 3

df_split =  np.array_split(df_1, np.ceil(len(df_1)/threshold))


faulty_ = []
i=0
for day in df_split:
    if one_value(list(day["values"])) == True:
            faulty_.append(i)

    i =1
for f_ in faulty_:
    df_1.drop(df_1.index[(0 f_*threshold):(threshold f_*threshold)],inplace=True)
    
print(df_1)

    values
0     0.23
1     0.24
2     0.21
3     0.19
4     0.00
5     0.00
9     0.11
10    0.14
11    0.12

This works fine so far, of course some 0s remain but for my context it is fine, as the threshold defines certain timeranges that are to be looked at on their own. Meaning that only if the whole threshold range is the same value it should be deleted. So far so good.

Now I somehow need to get two dataframes out of this one but with a flexible function that would potentially detect all indexes that suddenly "jump" in value, as here at 5 to 9. So the firstdataframe should be index 0 to 5, then the next 9 to 11 and if there are more breaks, again from whatever lies there.

How could I do that?

CodePudding user response：

Try this:

df:

df = pd.DataFrame({'values': [.23,.24,.21,.21,0,0,0,0,0,.11,.14,.12]})

Solution:

threshold = 3
s = df['values'].groupby(df['values'].diff().ne(0).cumsum()).transform('count').lt(threshold)
i = ~s

d = {n 1: df for n,(name,df) in enumerate(df.loc[s].groupby(i.cumsum()))}

The output is a dictionary that you can select which series you would like. d.get(1) will return the first series.