I have a dataframe that looks something like:
IndexMonth Cus1 Cus2 Cus3 Cus4 ........ Cusn
2019-01 0 111 0 0 333
2019-02 0 111 0 666 0
2019-03 500 0 333 55 0
2019-04 600 0 333 111 0
2019-05 600 100 0 111 0
I am looking to fetch the first non zero month for each Cus column, and also the last non-zero month. If the Cus has a break and it starts again, i want to have the new start also added in the start column. So my output should look something like :
StartMonth EndMonth
Cus1 2019-03 2019-05
Cus2 2019-01,2019-05 2019-02,2019-05
Cus3 2019-03 2019-04
Cus4 2019-02 2019-05
..
Cusn 2019-01 2019-01
Could you please confirm how efficiently i can achieve this.
CodePudding user response:
You can use masks to keep the first/last date per succession of non-zeros, then aggregate:
df2 = df.set_index('IndexMonth')
m = df2.ne(0)
start = (df2
.where(m&~m.shift(fill_value=False))
.stack()
.reset_index('IndexMonth')
.groupby(level=0)['IndexMonth']
.agg(','.join)
.rename('StartMonth')
)
end = (df2
.where(m&~m.shift(-1, fill_value=False))
.stack()
.reset_index('IndexMonth')
.groupby(level=0)['IndexMonth']
.agg(','.join)
.rename('EndMonth')
)
out = pd.concat([start, end], axis=1)
print(out)
Output:
StartMonth EndMonth
Cus1 2019-03 2019-05
Cus2 2019-01,2019-05 2019-02,2019-05
Cus3 2019-03 2019-04
Cus4 2019-02 2019-05
Cusn 2019-01 2019-01
CodePudding user response:
First i use the transpose function then i get the first value not equal to 0 with idxmax. I get the last value with last_valid_index
df2=df.T
df2['first']=df2.ne(0).idxmax(axis=1)
df2['last']=df2[df2.columns[0:5]].mask(df2==0).apply(pd.Series.last_valid_index, axis=1)
#You should set the numbers 0 and 5 according to the number of columns. Here I am only getting the month columns.
print(df2)
'''
IndexMonth 2019-01 2019-02 2019-03 2019-04 2019-05 first last
Cus1 0 0 500 600 600 2019-03 2019-05
Cus2 111 111 0 0 100 2019-01 2019-05
Cus3 0 0 333 333 0 2019-03 2019-04
Cus4 0 666 55 111 111 2019-02 2019-05
Cusn 333 0 0 0 0 2019-01 2019-01
'''
First step is ok. For the second phase, I tried something like this, it works with this data, but I'm not sure if it will work correctly when the number of months changes:
df2['check']=False
for i in range (0,len(df2.index)):
col_name=df2['first'][i] #which columns is the first ?
if len(df2.iloc[i].loc[lambda x : x == df2[col_name][i]]) >= 2: #if there is more than one of the same value
df2['check'][i]=True #return true and fill the first first value with 0. To be able to get the latter when using idxmax.
df2[col_name][i]=0
df2['first_2']=df2.ne(0).idxmax(axis=1)
df2['is_combine']=(df2['check']==True) & (df2['last'] != df2['first_2'])
df2['StartMonth']=np.where(df2['is_combine']==True,(df2['first'] ', ' df2['last']),df2['first'])
df2['EndMonth']=np.where(df2['is_combine']==True,(df2['first_2'] ', ' df2['last']),df2['last'])
df2=df2[['StartMonth','EndMonth']]
print(df2)
'''
IndexMonth StartMonth EndMonth
Cus1 2019-03 2019-05
Cus2 2019-01, 2019-05 2019-02, 2019-05
Cus3 2019-03 2019-04
Cus4 2019-02 2019-05
Cusn 2019-01 2019-01
'''