Need to create a new df that contains all the records for the maximum col value of n_c. For e.g- if n_c = 2 then only the first 2 records will be stored and the rest will be deleted and if n_c = 4 then the first 4 records will be stored and the rest will be deleted and the same goes for any other values in col n_c. Please help.
Given:
| id | sys_id | n_c | extract_id
|---------------------------------
| G400 | 02f4 | 1 | 284
| G406 | 993c | 2 | 389
| G406 | 993c | 2 | 382
| G406 | 993c | 2 | 129
| G408 | 996d | 4 | 396
| G408 | 996d | 4 | 378
| G408 | 996d | 4 | 368
| G408 | 996d | 4 | 333
| G408 | 996d | 4 | 222
Expected :
| id | sys_id | n_c | extract_id
|---------------------------------
| G400 | 02f4 | 1 | 284
| G406 | 993c | 2 | 389
| G406 | 993c | 2 | 382
| G408 | 996d | 4 | 396
| G408 | 996d | 4 | 378
| G408 | 996d | 4 | 368
| G408 | 996d | 4 | 333
import pandas as pd
# initialize data of lists.
data = {'id': ['G400', 'G406', 'G406', 'G406','G408','G408', 'G408', 'G408','G408'],
'sys_id': ['02f4', '993c', '993c', '993c', '996d','996d', '996d', '996d','996d'],
'n_c': [1,2,2,2,4,4,4,4,4],
'extract_id':[284,389,382,129,396,378,368,333,222]}
# Create DataFrame
df = pd.DataFrame(data)
CodePudding user response:
You can use boolean indexing with groupby.cumcount
:
df[df.groupby(['id', 'sys_id']).cumcount().lt(df['n_c'])]
Output:
id sys_id n_c extract_id
0 G400 02f4 1 284
1 G406 993c 2 389
2 G406 993c 2 382
4 G408 996d 4 396
5 G408 996d 4 378
6 G408 996d 4 368
7 G408 996d 4 333
Intermediates:
id sys_id n_c extract_id cum ount <n_c
0 G400 02f4 1 284 0 True
1 G406 993c 2 389 0 True
2 G406 993c 2 382 1 True
3 G406 993c 2 129 2 False
4 G408 996d 4 396 0 True
5 G408 996d 4 378 1 True
6 G408 996d 4 368 2 True
7 G408 996d 4 333 3 True
8 G408 996d 4 222 4 False
CodePudding user response:
You could do:
df.groupby(['id', 'sys_id']).apply(lambda x: x.head(x['n_c'].iat[0])).reset_index(drop=True)
id sys_id n_c extract_id
0 G400 02f4 1 284
1 G406 993c 2 389
2 G406 993c 2 382
3 G408 996d 4 396
4 G408 996d 4 378
5 G408 996d 4 368
6 G408 996d 4 333