Home > front end >  Select first row with a range (every 10 minutes)
Select first row with a range (every 10 minutes)

Time:08-24

I have a dataframe like this:

df = pd.DataFrame({"DateTime":["2020-04-02 06:06:22",
                               "2020-04-02 06:12:22",
                               "2020-04-02 06:14:39",
                               "2020-04-02 06:16:56",
                               "2020-04-02 06:20:34",
                               "2020-04-02 06:35:44"],
                   "Data":[23, 31, 10, 23, 56, 81]})

# column DateTime type must be datetime64[ns]
df["DateTime"] = df["DateTime"].astype("datetime64[ns]")

df
Out[4]: 
             DateTime  Data
0 2020-04-02 06:06:22    23
1 2020-04-02 06:12:22    31
2 2020-04-02 06:14:39    10
3 2020-04-02 06:16:56    23
4 2020-04-02 06:20:34    56
5 2020-04-02 06:35:44    81

I would like to select rows after every 10 min. So my dataframe should be like:

             DateTime  Data
0 2020-04-02 06:06:22    23
3 2020-04-02 06:16:56    23
5 2020-04-02 06:35:44    81

This solution How to drop rows based on datetime (every 15 min)? drops rows every 15 min but always looking at the exactly row below, so it deletes rows that I don't want. And actually I would like to select rows after a specific time range.

Anyone could help me?

CodePudding user response:

You could count the time elapsed from your first row and then divide this elapsed time into 10 minute buckets using an integer division -

df['time_elapsed'] = (df['DateTime'] - df['DateTime'].shift(1)).fillna(pd.Timedelta(seconds=0)).cumsum()
df['ten_min_bucket'] = pd.to_datetime(df['time_elapsed']).dt.minute // 10
df.groupby('ten_min_bucket')[['DateTime', 'Data']].agg('first')

Output

                          DateTime  Data
ten_min_bucket                          
0              2020-04-02 06:06:22    23
1              2020-04-02 06:16:56    23
2              2020-04-02 06:35:44    81

CodePudding user response:

This looks like a job for merge_asof:

# set up indexer DataFrame
df2 = pd.DataFrame({'idx': pd.date_range(df['DateTime'].min(),
                                         df['DateTime'].max(),
                                         freq='10min')
                    })
# get first value for each slice of 10 minutes
out =  (pd.merge_asof(df2, df, left_on='idx', right_on='DateTime',
                     direction='forward')
          #.drop(columns='idx') # uncomment to remove idx
        )

output:

                  idx            DateTime  Data
0 2020-04-02 06:06:22 2020-04-02 06:06:22    23
1 2020-04-02 06:16:22 2020-04-02 06:16:56    23
2 2020-04-02 06:26:22 2020-04-02 06:35:44    81

output with .drop(columns='idx'):

             DateTime  Data
0 2020-04-02 06:06:22    23
1 2020-04-02 06:16:56    23
2 2020-04-02 06:35:44    81
  • Related