Looking to drop first 5 rows of dataframe after ever new value occurs-CodePudding

I am looking to drop the first 5 rows each time a new value occurs in a dataframe

data = {
'col1': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
}
df = pd.DataFrame(data)

I am looking to drop the first 5 rows after each new value. Ex: 'A' value is new... delete first 5 rows. Now encounter 'B' value... delete its first 5 rows...

CodePudding user response：

You need to do the following:

mask = df.groupby('col1').cumcount() >= 5
df = df.loc[mask]

CodePudding user response：

You can use a negative tail:

df.groupby('col1').tail(-5)

To group by consecutive values:

group = df['col1'].ne(df['col1'].shift()).cumsum()
df.groupby(group).tail(-5)

Output:

   col1  col2
5     A     6
6     A     7
12    B    13
13    B    14
19    C    20
20    C    21

NB. As pointed out by @Mark, there is an issue for older pandas versions (<1.4), in which case the cumcount approach can be used.