I've got data with two index columns (id, timestemp) and data columns (velocity). The data could look like this
Id timestemp velocity
0 0 0
0 1 0
0 2 0
1 1 1
1 2 1
1 3 0
1 4 1
1 5 1
2 4 1
2 5 1
2 6 1
My goal is now to remove all rows, where velocity is zero. Additionally, all timesteps should be consecutive. If there is a jump in the timesteps, the rows should be split up with a new id.
So after removing the rows with velocity = 0, the data look like this:
Id timestemp velocity
1 1 1
1 2 1
1 4 1
1 5 1
2 4 1
2 5 1
2 6 1
Now notice the jump in row (1,2) to (1,4). Therefore rows (1, 4) and rows (1,5) should get a new unique index (for instance: max index 1). The data then looks like this:
Id timestemp velocity
1 1 1
1 2 1
2 4 1
2 5 1
2 6 1
3 4 1
3 5 1
I've got some code for removing the rows:
df = df.drop(df[velocity == 0].index)
But I don't know how to assign rows a new unique index if there is a jump in the second index.
CodePudding user response:
You can use groupby
with custom groups:
# which rows are velocity=0?
m = df['velocity'].eq(0)
# which timestamps are consecutive per id?
# make groups
g1 = df.groupby('Id')['timestemp'].diff().ne(1).cumsum()
g2 = m.cumsum()
out = df[~m].assign(new_id=lambda d: d.groupby([g1, g2]).ngroup())
output:
Id timestemp velocity new_id
3 1 1 1 0
4 1 2 1 0
6 1 4 1 1
7 1 5 1 1
8 2 4 1 2
9 2 5 1 2
10 2 6 1 2