Home > Enterprise >  How to delete rows based on change in variable in pandas dataframe
How to delete rows based on change in variable in pandas dataframe

Time:12-18

I've got a dataset with an insanely high sampling rate, and would like to remove excess data where the columnar value changes less than a predefined value down through the dataset. However, some intermediary points need to be kept in order to not loose all data.

e.g.

      t     V
0   1.0   1.0
1   2.0   1.2
2   3.0   2.0
3   3.3   3.0
4   3.4   4.0
5   3.7   4.2
6   3.8   4.6
7   4.4   5.4
8   5.1   6.0
9   6.0   7.0
10  7.0  10.0

Now I want to delete all the rows where the change in V from one row to another is less than dV, AND the change in t is below dt, but still keep datapoints such that there is data at roughly every interval dV or dt.

Lets say for dV = 1 and dt = 1, the wanted output would be:

      t     V
0   1.0   1.0
1   2.0   1.2
2   3.0   2.0
3   3.3   3.0
4   3.4   4.0


7   4.4   5.4

9   6.0   7.0
10  7.0  10.0

Meaning row 5, 6 and 8 was deleted since it was within the changevalue, but row 7 remains since it has a changevalue above dt and dV in both directions.

The easy solution is iterating over the rows in the dataframe, but a faster (and more proper) solution is wanted.

EDIT: The question was edited to reflect the point that intermediary points must be kept in order to not delete too much.

CodePudding user response:

Use DataFrame.diff with boolean indexing:

dV = 1
dt = 1

df = df[~(df['t'].diff().lt(dt) & df['V'].diff().lt(dV))]
print (df)
      t     V
0   1.0   1.0
1   2.0   1.2
2   3.0   2.0
3   3.3   3.0
4   3.4   4.0
7   5.0   6.0
8   5.1   8.0
9   6.0   9.0
10  7.0  10.0

Or:

dV = 1
dt = 1

df1 = df.diff()

df = df[df1['t'].fillna(dt).ge(dt) | df1['V'].fillna(dV).ge(dV)]
print (df)
      t     V
0   1.0   1.0
1   2.0   1.2
2   3.0   2.0
3   3.3   3.0
4   3.4   4.0
7   5.0   6.0
8   5.1   8.0
9   6.0   9.0
10  7.0  10.0

CodePudding user response:

you might want to use shift() method:

diff_df = df - df.shift()

and then filter rows with loc:

diff_df = diff_df.loc[diff_df['V'] > 1.0 & diff_df['t'] > 1.0]

CodePudding user response:

You can use loc for boolean indexing and do the comparison between the values between rows within each column using shift():

# Thresholds
dv = 1
dt = 1

# Filter out
print(df.loc[~((df.V.sub(df.V.shift()) < 1) & (df.t.sub(df.t.shift()) < 1))])

      t     V
0   1.0   1.0
1   2.0   1.2
2   3.0   2.0
3   3.3   3.0
4   3.4   4.0
7   5.0   6.0
8   5.1   8.0
9   6.0   9.0
10  7.0  10.0
  • Related