I have a DataFrame like as shown below
ID_1,time_1,time_2
1,21,0
1,31,5
1,0,0
1,21,100
1,21,21
2,202,0
2,310,
2,0,0
2,201,
2,210,
2,178,190
I would like to fetch the latest value which is from time_2
column.
However, whenever time_2
column has zero or empty, I would like to pick the value from time_1
column.
If both time_1
and time_2
are zero, then we put just 0.
I was trying something like below
tdf['latest_value'] = tdf['time_2']
tdf['time_2'] = np.where((tdf['time_2']==0 | tdf['time_2'].isna()==True),tdf['time_1'],tdf['time_2'])
I expect my output to be like as shown below
CodePudding user response:
Replace 0
values to missing values with replace missing values by another column:
tdf['latest_value'] = tdf['time_2'].replace(0, np.nan).fillna(tdf['time_1'])
print (tdf)
ID_1 time_1 time_2 latest_value
0 1 21 0.0 21.0
1 1 31 5.0 5.0
2 1 0 0.0 0.0
3 1 21 100.0 100.0
4 1 21 21.0 21.0
5 2 202 0.0 202.0
6 2 310 NaN 310.0
7 2 0 0.0 0.0
8 2 201 NaN 201.0
9 2 210 NaN 210.0
10 2 178 190.0 190.0
Or if possible many columns first replace, forward filling missing values and select last column with replace missing values to 0
:
c = ['time_1', 'time_2']
tdf['latest_value'] = tdf[c].replace(0, np.nan).ffill(axis=1).iloc[:, -1].fillna(0)
print (tdf)
ID_1 time_1 time_2 latest_value
0 1 21 0.0 21.0
1 1 31 5.0 5.0
2 1 0 0.0 0.0
3 1 21 100.0 100.0
4 1 21 21.0 21.0
5 2 202 0.0 202.0
6 2 310 NaN 310.0
7 2 0 0.0 0.0
8 2 201 NaN 201.0
9 2 210 NaN 210.0
10 2 178 190.0 190.0
CodePudding user response:
You can use mask
and fillna
(or combine_first
):
tdf['latest_value']=tdf['time_2'].mask(df['time_2'] == 0).fillna(tdf['time_1'])
# OR
tdf['latest_value']=tdf['time_2'].mask(df['time_2'] == 0).combine_first(df['time_1'])
Output:
ID_1 time_1 time_2 latest_value
0 1 21 0.0 21.0
1 1 31 5.0 5.0
2 1 0 0.0 0.0
3 1 21 100.0 100.0
4 1 21 21.0 21.0
5 2 202 0.0 202.0
6 2 310 NaN 310.0
7 2 0 0.0 0.0
8 2 201 NaN 201.0
9 2 210 NaN 210.0
10 2 178 190.0 190.0
CodePudding user response:
numpy.where
works here too. In your code, you're just missing a parenthesis since |
has higher precedence than ==
:
tdf['latest_value'] = np.where((tdf['time_2']==0) | tdf['time_2'].isna(), tdf['time_1'], tdf['time_2'])
Output:
ID_1 time_1 time_2 latest_value
0 1 21 0.0 21.0
1 1 31 5.0 5.0
2 1 0 0.0 0.0
3 1 21 100.0 100.0
4 1 21 21.0 21.0
5 2 202 0.0 202.0
6 2 310 NaN 310.0
7 2 0 0.0 0.0
8 2 201 NaN 201.0
9 2 210 NaN 210.0
10 2 178 190.0 190.0