I have pd.DataFrame
with time series as index:
a b
2018-01-02 12:30:00 00:00 NaN NaN
2018-01-02 13:45:00 00:00 NaN 232.0
2018-01-02 14:00:00 00:00 133.0 133.0
2018-01-02 14:15:00 00:00 134.0 134.0
I am interested in preserving the first non-NaN value of each columns and the rest of elements should be NaN
a b
2018-01-02 12:30:00 00:00 NaN NaN
2018-01-02 13:45:00 00:00 NaN 232.0
2018-01-02 14:00:00 00:00 133.0 NaN
2018-01-02 14:15:00 00:00 NaN NaN
Does pandas/numpy have an operation to achieve this in a vectorized way (without writing for loops)?
CodePudding user response:
You can try apply
Series.first_valid_index
per column and mask the other rows with nan
df[df.apply(lambda col: col.index != col.first_valid_index())] = np.nan
print(df)
a b
2018-01-02 12:30:00 00:00 NaN NaN
2018-01-02 13:45:00 00:00 NaN 132.0
2018-01-02 14:00:00 00:00 133.0 NaN
2018-01-02 14:15:00 00:00 NaN NaN
CodePudding user response:
Using a boolean masking:
m1 = df.isna().cummin() # get NAs prior to first non-NA
m2 = m1.shift(fill_value=False) # get first non-NA and after
out = df.where(m2&~m1)
output:
a b
2018-01-02 12:30:00 00:00 NaN NaN
2018-01-02 13:45:00 00:00 NaN 232.0
2018-01-02 14:00:00 00:00 133.0 NaN
2018-01-02 14:15:00 00:00 NaN NaN