I have some data.
I want to remain with rows when an ID has 4 consecutive numbers. For example, if ID 1 has rows 100, 101, 102, 103, 105, the "105" should be excluded.
Data:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
4 1 105
5 2 100
6 2 102
7 2 103
8 2 104
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
18 3 112
19 4 100
20 4 102
21 4 103
22 4 104
23 4 105
24 4 107
Expected results:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
4 3 100
5 3 101
6 3 102
7 3 103
8 3 106
9 3 107
10 3 108
11 3 109
12 3 110
13 4 102
14 4 103
15 4 104
16 4 105
CodePudding user response:
You can identify the consecutive values, then filter the groups by size with groupby.filter
:
# group consecutive X
g = df['X'].diff().gt(1).cumsum() # no need to group here, we'll group later
# filter groups
out = df.groupby(['ID', g]).filter(lambda g: len(g)>=4)#.reset_index(drop=True)
output:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105
CodePudding user response:
Another method:
out = df.groupby(df.groupby('ID')['X'].diff().ne(1).cumsum()).filter(lambda x: len(x) >= 4)
print(out)
# Output
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105