Home > Net >  use pandas to pick latest value from time based columns
use pandas to pick latest value from time based columns

Time:03-11

I have a DataFrame like as shown below

ID_1,time_1,time_2       
1,21,0
1,31,5
1,0,0
1,21,100
1,21,21
2,202,0
2,310,
2,0,0
2,201,
2,210,
2,178,190

I would like to fetch the latest value which is from time_2 column.

However, whenever time_2 column has zero or empty, I would like to pick the value from time_1 column.

If both time_1 and time_2 are zero, then we put just 0.

I was trying something like below

tdf['latest_value'] = tdf['time_2']
tdf['time_2'] = np.where((tdf['time_2']==0 | tdf['time_2'].isna()==True),tdf['time_1'],tdf['time_2'])

I expect my output to be like as shown below

enter image description here

CodePudding user response:

Replace 0 values to missing values with replace missing values by another column:

tdf['latest_value'] = tdf['time_2'].replace(0, np.nan).fillna(tdf['time_1'])
print (tdf)
    ID_1  time_1  time_2  latest_value
0      1      21     0.0          21.0
1      1      31     5.0           5.0
2      1       0     0.0           0.0
3      1      21   100.0         100.0
4      1      21    21.0          21.0
5      2     202     0.0         202.0
6      2     310     NaN         310.0
7      2       0     0.0           0.0
8      2     201     NaN         201.0
9      2     210     NaN         210.0
10     2     178   190.0         190.0

Or if possible many columns first replace, forward filling missing values and select last column with replace missing values to 0:

c = ['time_1', 'time_2']
tdf['latest_value'] = tdf[c].replace(0, np.nan).ffill(axis=1).iloc[:, -1].fillna(0)
print (tdf)
    ID_1  time_1  time_2  latest_value
0      1      21     0.0          21.0
1      1      31     5.0           5.0
2      1       0     0.0           0.0
3      1      21   100.0         100.0
4      1      21    21.0          21.0
5      2     202     0.0         202.0
6      2     310     NaN         310.0
7      2       0     0.0           0.0
8      2     201     NaN         201.0
9      2     210     NaN         210.0
10     2     178   190.0         190.0

CodePudding user response:

You can use mask and fillna (or combine_first):

tdf['latest_value']=tdf['time_2'].mask(df['time_2'] == 0).fillna(tdf['time_1'])

# OR

tdf['latest_value']=tdf['time_2'].mask(df['time_2'] == 0).combine_first(df['time_1'])

Output:

    ID_1  time_1  time_2  latest_value
0      1      21     0.0          21.0
1      1      31     5.0           5.0
2      1       0     0.0           0.0
3      1      21   100.0         100.0
4      1      21    21.0          21.0
5      2     202     0.0         202.0
6      2     310     NaN         310.0
7      2       0     0.0           0.0
8      2     201     NaN         201.0
9      2     210     NaN         210.0
10     2     178   190.0         190.0

CodePudding user response:

numpy.where works here too. In your code, you're just missing a parenthesis since | has higher precedence than ==:

tdf['latest_value'] = np.where((tdf['time_2']==0) | tdf['time_2'].isna(), tdf['time_1'], tdf['time_2'])

Output:

    ID_1  time_1  time_2  latest_value
0      1      21     0.0          21.0
1      1      31     5.0           5.0
2      1       0     0.0           0.0
3      1      21   100.0         100.0
4      1      21    21.0          21.0
5      2     202     0.0         202.0
6      2     310     NaN         310.0
7      2       0     0.0           0.0
8      2     201     NaN         201.0
9      2     210     NaN         210.0
10     2     178   190.0         190.0
  • Related