Home > Enterprise >  How to select the first valid rows in a pandas dataframe?
How to select the first valid rows in a pandas dataframe?

Time:08-19

I have pd.DataFrame with time series as index:

                               a      b
2018-01-02 12:30:00 00:00    NaN    NaN
2018-01-02 13:45:00 00:00    NaN  232.0
2018-01-02 14:00:00 00:00  133.0  133.0
2018-01-02 14:15:00 00:00  134.0  134.0

I am interested in preserving the first non-NaN value of each columns and the rest of elements should be NaN

                               a      b
2018-01-02 12:30:00 00:00    NaN    NaN
2018-01-02 13:45:00 00:00    NaN  232.0
2018-01-02 14:00:00 00:00  133.0    NaN
2018-01-02 14:15:00 00:00    NaN    NaN

Does pandas/numpy have an operation to achieve this in a vectorized way (without writing for loops)?

CodePudding user response:

You can try apply Series.first_valid_index per column and mask the other rows with nan

df[df.apply(lambda col: col.index != col.first_valid_index())] = np.nan
print(df)

                               a      b
2018-01-02 12:30:00 00:00    NaN    NaN
2018-01-02 13:45:00 00:00    NaN  132.0
2018-01-02 14:00:00 00:00  133.0    NaN
2018-01-02 14:15:00 00:00    NaN    NaN

CodePudding user response:

Using a boolean masking:

m1 = df.isna().cummin()  # get NAs prior to first non-NA
m2 = m1.shift(fill_value=False) # get first non-NA and after

out = df.where(m2&~m1)

output:

                               a      b
2018-01-02 12:30:00 00:00    NaN    NaN
2018-01-02 13:45:00 00:00    NaN  232.0
2018-01-02 14:00:00 00:00  133.0    NaN
2018-01-02 14:15:00 00:00    NaN    NaN
  • Related