there! I have the following situation and any help would be very appreciated.
Let's say I have the following dataframe, containing 2 columns and 90 thousand rows (made this shorter so it can be easily reproduced):
PRODUCT ID PROBLEM
0 1 OIL LEAK
1 2 FLAT TIRE
2 3 OIL LEAK
3 4 ENGINE ISSUES
4 5 ENGINE ISSUES
5 6 OIL LEAK
6 7 OIL LEAK
7 8 FLAT TIRE
8 9 FLAT TIRE
9 90000 OIL LEAK
I need to drop SOME rows (but not all) based on values from column 'PROBLEM'. Imagine the value 'OIL LEAK' appears in my dataframe 11 thousand times, but I want to keep only 50 entries of this value in my dataframe and delete all the other rows this value appears. For me, it's not important the index of the row that is being droppeg as long as I have 50 registers of this value remaining in my dataframe.
Is there a way to perform it? Thanks in advance!
CodePudding user response:
You can save 50 oil leaks and concat them after removing for instance?
leaks = df[df['PROBLEM'] == 'OIL LEAK'].head(50)
df = df[df['PROBLEM'] != 'OIL LEAK'].concat(leaks)
CodePudding user response:
In general we can use grouping with cumulative count like this:
df[df.groupby('PROBLEM').cumcount() < 50]
In order to apply this logic only to some values in the PROBLEM
column:
counted = df.groupby('PROBLEM').cumcount()
max_count = 50
problems_to_cut = ['OIL LEAK']
selected = df[~((counted >= max_count) & (df.PROBLEM.isin(problems_to_cut)))]