Home > Software engineering >  fetch the first nonzero entry for each column and record the corresponding index value
fetch the first nonzero entry for each column and record the corresponding index value

Time:11-20

I have a dataframe that looks something like:

IndexMonth   Cus1   Cus2  Cus3  Cus4 ........ Cusn
2019-01       0     111    0      0           333
2019-02       0     111    0     666           0
2019-03      500     0     333    55           0  
2019-04      600     0     333    111          0
2019-05      600    100    0      111          0

I am looking to fetch the first non zero month for each Cus column, and also the last non-zero month. If the Cus has a break and it starts again, i want to have the new start also added in the start column. So my output should look something like :

       StartMonth          EndMonth
Cus1   2019-03             2019-05
Cus2   2019-01,2019-05     2019-02,2019-05
Cus3   2019-03             2019-04
Cus4   2019-02             2019-05
..
Cusn   2019-01             2019-01

Could you please confirm how efficiently i can achieve this.

CodePudding user response:

You can use masks to keep the first/last date per succession of non-zeros, then aggregate:

df2 = df.set_index('IndexMonth')
m = df2.ne(0)

start = (df2
    .where(m&~m.shift(fill_value=False))
    .stack()
    .reset_index('IndexMonth')
    .groupby(level=0)['IndexMonth']
    .agg(','.join)
    .rename('StartMonth')
 )

end = (df2
    .where(m&~m.shift(-1, fill_value=False))
    .stack()
    .reset_index('IndexMonth')
    .groupby(level=0)['IndexMonth']
    .agg(','.join)
    .rename('EndMonth')
 )

out = pd.concat([start, end], axis=1)

print(out)

Output:

           StartMonth         EndMonth
Cus1          2019-03          2019-05
Cus2  2019-01,2019-05  2019-02,2019-05
Cus3          2019-03          2019-04
Cus4          2019-02          2019-05
Cusn          2019-01          2019-01

CodePudding user response:

First i use the transpose function then i get the first value not equal to 0 with idxmax. I get the last value with last_valid_index

df2=df.T
df2['first']=df2.ne(0).idxmax(axis=1)
df2['last']=df2[df2.columns[0:5]].mask(df2==0).apply(pd.Series.last_valid_index, axis=1)
#You should set the numbers 0 and 5 according to the number of columns. Here I am only getting the month columns.

print(df2)
'''
IndexMonth  2019-01  2019-02  2019-03  2019-04  2019-05    first     last
Cus1              0        0      500      600      600  2019-03  2019-05
Cus2            111      111        0        0      100  2019-01  2019-05
Cus3              0        0      333      333        0  2019-03  2019-04
Cus4              0      666       55      111      111  2019-02  2019-05
Cusn            333        0        0        0        0  2019-01  2019-01
'''

First step is ok. For the second phase, I tried something like this, it works with this data, but I'm not sure if it will work correctly when the number of months changes:

df2['check']=False
for i in range (0,len(df2.index)):
    col_name=df2['first'][i] #which columns is the first ?
    if len(df2.iloc[i].loc[lambda x : x == df2[col_name][i]]) >= 2: #if there is more than one of the same value

        df2['check'][i]=True #return true and fill the first first value with 0. To be able to get the latter when using idxmax.

        df2[col_name][i]=0
df2['first_2']=df2.ne(0).idxmax(axis=1)
df2['is_combine']=(df2['check']==True) & (df2['last'] != df2['first_2'])
df2['StartMonth']=np.where(df2['is_combine']==True,(df2['first']  ', ' df2['last']),df2['first'])
df2['EndMonth']=np.where(df2['is_combine']==True,(df2['first_2']  ', ' df2['last']),df2['last'])
df2=df2[['StartMonth','EndMonth']]
print(df2)
'''
IndexMonth        StartMonth          EndMonth
Cus1                 2019-03           2019-05
Cus2        2019-01, 2019-05  2019-02, 2019-05
Cus3                 2019-03           2019-04
Cus4                 2019-02           2019-05
Cusn                 2019-01           2019-01
'''
  • Related