Home > Software engineering >  Combining np.where with conditional duplicate to assign column value
Combining np.where with conditional duplicate to assign column value

Time:03-01

I have put together a conditional statement using the duplicate function to extract a row from the following data. That is extracted from a larger dataset

{'NID': {104565: '213003580',
  104566: '213003580',
  104567: '213003580',
  104568: '213003580',
  104569: '213003580',
  104570: '213003580',
  104571: '213003580',
  104572: '213003580',
  104573: '213003580',
  104574: '213003580'},
 'Fdat': {104565: Timestamp('2015-02-24 00:00:00'),
  104566: Timestamp('2015-02-24 00:00:00'),
  104567: Timestamp('2016-03-20 00:00:00'),
  104568: Timestamp('2016-03-20 00:00:00'),
  104569: Timestamp('2017-03-13 00:00:00'),
  104570: Timestamp('2018-03-05 00:00:00'),
  104571: Timestamp('2018-03-05 00:00:00'),
  104572: Timestamp('2018-03-05 00:00:00'),
  104573: Timestamp('2018-03-05 00:00:00'),
  104574: Timestamp('2018-03-05 00:00:00')},
 'BCode': {104565: 0,
  104566: 1,
  104567: 0,
  104568: 1,
  104569: 0,
  104570: 0,
  104571: 0,
  104572: 0,
  104573: 1,
  104574: 1},
 'Breddat': {104565: Timestamp('2015-06-09 00:00:00'),
  104566: Timestamp('2015-08-19 00:00:00'),
  104567: Timestamp('2016-06-05 00:00:00'),
  104568: Timestamp('2016-08-23 00:00:00'),
  104569: Timestamp('2017-05-31 00:00:00'),
  104570: Timestamp('2018-06-06 00:00:00'),
  104571: Timestamp('2018-06-27 00:00:00'),
  104572: Timestamp('2018-11-22 00:00:00'),
  104573: Timestamp('2019-02-14 00:00:00'),
  104574: Timestamp('2019-06-18 00:00:00')},
 'Code': {104565: 2,
  104566: 10,
  104567: 2,
  104568: 10,
  104569: 2,
  104570: 2,
  104571: 2,
  104572: 2,
  104573: 10,
  104574: 10},
 'Result': {104565: <NA>,
  104566: 71,
  104567: <NA>,
  104568: 79,
  104569: <NA>,
  104570: <NA>,
  104571: <NA>,
  104572: <NA>,
  104573: 84,
  104574: 0},
 'DIM': {104565: 105,
  104566: 176,
  104567: 77,
  104568: 156,
  104569: 79,
  104570: 93,
  104571: 114,
  104572: 262,
  104573: 346,
  104574: 470},
 'AbCode': {104565: 0,
  104566: 0,
  104567: 0,
  104568: 0,
  104569: 0,
  104570: 0,
  104571: 0,
  104572: 0,
  104573: 0,
  104574: 0}}

Formatted the data looks like

         NID        Fdat      BCode Breddat   Code  Result  DIM Result  AbCode
104565  213003580   2015-02-24  0   2015-06-09  2   <NA>    105 <NA>    0
104566  213003580   2015-02-24  1   2015-08-19  10  71      176  71     0
104567  213003580   2016-03-20  0   2016-06-05  2   <NA>    77  <NA>    0
104568  213003580   2016-03-20  1   2016-08-23  10  79      156  79     0
104569  213003580   2017-03-13  0   2017-05-31  2   <NA>    79  <NA>    0
104570  213003580   2018-03-05  0   2018-06-06  2   <NA>    93  <NA>    0
104571  213003580   2018-03-05  0   2018-06-27  2   <NA>    114 <NA>    0
104572  213003580   2018-03-05  0   2018-11-22  2   <NA>    262 <NA>    0
104573  213003580   2018-03-05  1   2019-02-14  10  84      346  84     0
104574  213003580   2018-03-05  1   2019-06-18  10  0       470  0      0

The code to extract the row I want is as follows

out1 = df[((df.duplicated(['NID','Fdat','BCode'], keep='last') & (df['BCode'].eq(1)) & (df['Result'].ge(0))) )].copy()

This returns the second last line in this example and all others I need in the complete dataset

104573  213003580   2018-03-05  1   2019-02-14  10  84      346  84     0

The significance of this line is that the cow aborted and I would like to toggle column AbCode from 0 to 1

I have tried the following unsuccessfully

df['AbCode'] = np.where([((df.duplicated(['NID','Fdat','BCode'], keep='last') & (df['BCode'].eq(1)) & (df['Result'].ge(0))) )], 1,0)

This gives an error

ValueError: Length of values (1) does not match length of index (121627)

Is there a simple way to toggle the value of AbCode according to these conditions?

CodePudding user response:

numpy.where() accepts an array_like boolean condition. Result of (df.duplicated(['NID','Fdat','BCode'], keep='last') & (df['BCode'].eq(1)) & (df['Result'].ge(0))) is already a list, you don't need to wrap it into another list.

df['AbCode'] = np.where((df.duplicated(['NID','Fdat','BCode'], keep='last') & (df['BCode'].eq(1)) & (df['Result'].ge(0))) , 1, 0)
  • Related