Existing Dataframe :
Id Month Year scheduled completed
A Jan 2021 0 0
A Feb 2021 1 0
A mar 2021 0 0
B June 2021 0 1
B July 2021 0 1
B Aug 2021 0 1
B Sep 2021 0 1
C Nov 2021 1 0
C Dec 2021 1 0
C Jan 2022 1 0
C Feb 2022 1 0
Expected Dataframe :
Id status
A defaulter
B non_defaulter
C defaulter
I am trying to create a status tag for each basis their activity. if for three consecutive Month if completed column remains 0 , that Id is to be tagged as "defaulter" else "non_defaulter"
CodePudding user response:
You can use a double groupby
to count the consecutive 1s in completed
, then to ensure there is at least 1 stretch greater or equal to N=3
:
N = 3
# is the row a zero?
m = df['completed'].eq(0)
# count the consecutive zeros
(m.groupby([df['Id'], (~m).cumsum()])
.sum().ge(N)
# check if there is at least one stretch of value >= N
.groupby(level=0).any()
# convert the True/False into strings
.map({False: 'non_defaulter', True: 'defaulter'})
.reset_index(name='status')
)
Output:
Id status
0 A defaulter
1 B non_defaulter
2 C defaulter
CodePudding user response:
Idea is aggregate per consecutive groups of 0
values per Id
and helper Series s
created by Series.cumsum
, count True
s values, then because possible multiple groups per Id
aggregate max
and last set values in numpy.where
:
N = 3
m = df['completed'].ne(0)
df = ((~m).groupby([df['Id'], m.cumsum().mask(m, -1)])
.sum()
.groupby(level=0)
.max()
.reset_index(name='status')
.assign(status = lambda x: np.where(x['status'].ge(N),
'defaulter','non_defaulter')))
print (df)
Id status
0 A defaulter
1 B non_defaulter
2 C defaulter