Home > front end >  Pandas: groupby by date then return first valid value with matching datetime
Pandas: groupby by date then return first valid value with matching datetime

Time:05-10

With:

df = pd.DataFrame({'datetime': pd.date_range('2022-05-01 10:00:00', periods=10, freq='10H'), 'value': [np.nan, np.nan, np.nan, -0.61, np.nan, 0.55, 0.63, np.nan, 0.15, np.nan]})

df
             datetime  value
0 2022-05-01 10:00:00    NaN
1 2022-05-01 20:00:00    NaN
2 2022-05-02 06:00:00    NaN
3 2022-05-02 16:00:00  -0.61
4 2022-05-03 02:00:00    NaN
5 2022-05-03 12:00:00   0.55
6 2022-05-03 22:00:00   0.63
7 2022-05-04 08:00:00    NaN
8 2022-05-04 18:00:00   0.15
9 2022-05-05 04:00:00    NaN

How can I get the first occurrence of valid value groupby date and its corresponding datetime:

date                  datetime   value
2022-05-02  2022-05-02 16:00:00  -0.61
2022-05-03  2022-05-03 12:00:00   0.55
2022-05-04  2022-05-04 18:00:00   0.15

I used: df.groupby([df['datetime'].dt.date]).first() but it gives me the following dataframe, in which the datetime is the first occurrence on that day, not the value's corresponding datetime that I need:

                      datetime  value
datetime                             
2022-05-01 2022-05-01 10:00:00    NaN
2022-05-02 2022-05-02 06:00:00  -0.61
2022-05-03 2022-05-03 02:00:00   0.55
2022-05-04 2022-05-04 08:00:00   0.15
2022-05-05 2022-05-05 04:00:00    NaN

CodePudding user response:

Another way, dropna, group by date extracted from datetime

df[df['value'].notna()].groupby(df['datetime'].dt.date).first()


                    datetime  value
datetime                             
2022-05-02 2022-05-02 16:00:00  -0.61
2022-05-03 2022-05-03 12:00:00   0.55
2022-05-04 2022-05-04 18:00:00   0.15

CodePudding user response:

Use DataFrame.dropna:

df1 = df.dropna(subset=['value']).groupby(df['datetime'].dt.date).first()
print (df1)
                      datetime  value
datetime                             
2022-05-02 2022-05-02 16:00:00  -0.61
2022-05-03 2022-05-03 12:00:00   0.55
2022-05-04 2022-05-04 18:00:00   0.15

If need also missing values:

d = df['datetime'].dt.date
df = df.groupby(d).bfill().set_index(d).loc[lambda x: ~x.index.duplicated()]
print (df)
                      datetime  value
datetime                             
2022-05-01 2022-05-01 10:00:00    NaN
2022-05-02 2022-05-02 06:00:00  -0.61
2022-05-03 2022-05-03 02:00:00   0.55
2022-05-04 2022-05-04 08:00:00   0.15
2022-05-05 2022-05-05 04:00:00    NaN
  • Related