Home > Software engineering >  Iterate through col value and reject the extra records
Iterate through col value and reject the extra records

Time:10-18

Need to create a new df that contains all the records for the maximum col value of n_c. For e.g- if n_c = 2 then only the first 2 records will be stored and the rest will be deleted and if n_c = 4 then the first 4 records will be stored and the rest will be deleted and the same goes for any other values in col n_c. Please help.

Given:

| id   | sys_id | n_c | extract_id
|---------------------------------  
| G400 |  02f4  |  1  |   284
| G406 |  993c  |  2  |   389
| G406 |  993c  |  2  |   382
| G406 |  993c  |  2  |   129
| G408 |  996d  |  4  |   396
| G408 |  996d  |  4  |   378
| G408 |  996d  |  4  |   368
| G408 |  996d  |  4  |   333
| G408 |  996d  |  4  |   222

Expected :

| id   | sys_id | n_c | extract_id
|---------------------------------  
| G400 |  02f4  |  1  |   284
| G406 |  993c  |  2  |   389
| G406 |  993c  |  2  |   382
| G408 |  996d  |  4  |   396
| G408 |  996d  |  4  |   378
| G408 |  996d  |  4  |   368
| G408 |  996d  |  4  |   333
import pandas as pd
  
# initialize data of lists.
data = {'id': ['G400', 'G406', 'G406', 'G406','G408','G408', 'G408', 'G408','G408'],
        'sys_id': ['02f4', '993c', '993c', '993c', '996d','996d', '996d', '996d','996d'],
        'n_c': [1,2,2,2,4,4,4,4,4],
        'extract_id':[284,389,382,129,396,378,368,333,222]}
  
# Create DataFrame
df = pd.DataFrame(data)

CodePudding user response:

You can use boolean indexing with groupby.cumcount:

df[df.groupby(['id', 'sys_id']).cumcount().lt(df['n_c'])]

Output:

     id sys_id  n_c  extract_id
0  G400   02f4    1         284
1  G406   993c    2         389
2  G406   993c    2         382
4  G408   996d    4         396
5  G408   996d    4         378
6  G408   996d    4         368
7  G408   996d    4         333

Intermediates:

     id sys_id  n_c  extract_id  cum ount   <n_c
0  G400   02f4    1         284         0   True
1  G406   993c    2         389         0   True
2  G406   993c    2         382         1   True
3  G406   993c    2         129         2  False
4  G408   996d    4         396         0   True
5  G408   996d    4         378         1   True
6  G408   996d    4         368         2   True
7  G408   996d    4         333         3   True
8  G408   996d    4         222         4  False

CodePudding user response:

You could do:

df.groupby(['id', 'sys_id']).apply(lambda x: x.head(x['n_c'].iat[0])).reset_index(drop=True)

     id sys_id  n_c  extract_id
0  G400   02f4    1         284
1  G406   993c    2         389
2  G406   993c    2         382
3  G408   996d    4         396
4  G408   996d    4         378
5  G408   996d    4         368
6  G408   996d    4         333
  • Related