I am looking to drop the first 5 rows each time a new value occurs in a dataframe
data = {
'col1': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
}
df = pd.DataFrame(data)
I am looking to drop the first 5 rows after each new value. Ex: 'A' value is new... delete first 5 rows. Now encounter 'B' value... delete its first 5 rows...
CodePudding user response:
You need to do the following:
mask = df.groupby('col1').cumcount() >= 5
df = df.loc[mask]
CodePudding user response:
You can use a negative tail
:
df.groupby('col1').tail(-5)
To group by consecutive values:
group = df['col1'].ne(df['col1'].shift()).cumsum()
df.groupby(group).tail(-5)
Output:
col1 col2
5 A 6
6 A 7
12 B 13
13 B 14
19 C 20
20 C 21
NB. As pointed out by @Mark, there is an issue for older pandas versions (<1.4), in which case the cumcount
approach can be used.