iterating every datetime rows to find the previous datetime =/nearest to a specified time interval --CodePudding

problem: Iterate each row of 'datetime' and find the previous datetime (in the same column) with an interval of 20 seconds (e.g. if the previous datetime (i-1) is 20s or greater before it will be chosen, if not (like 5 seconds only), move to i-2 and see if it is 20s or greater until the condition is met, NaN in the beginning since there is no time before it). When the previous datetime is found it will be put into a new column and also create new columns for the other values in the same row with the previous time.

example dataframe result:

Dataframe result

the last three columns are the newly created columns while the first three columns are the original dataset

I'm new to python please help me

CodePudding user response：

IIUC, you can do it in a vectorized (fast) way, using a rolling window on column Datetime based on a time offset:

# 1. find the first row of rolling windows of 20s, left-open
# then subtract 1 (to get first row at 20s or more away from
# the current one)

z = (
    df.assign(rownum=range(len(df)))
    .rolling(pd.Timedelta('20s'), on='Datetime', closed='right')['rownum']
    .apply(min).astype(int) - 1
)

# 2. take the rows indicated by z, make sure row index < 0
# are replaced by NaN (or NaT for Datetime), amend the column
# names and concat

cols = ['Datetime', 'A', 'B']
newcols = [f'{k}_nearest' for k in cols]
out = pd.concat([
    df[cols],
    df.iloc[z][cols].set_axis(newcols, axis=1).reset_index(drop=True).where(z >= 0)
], axis=1)

On your data:

>>> z
0   -1
1    0
2    0
3    2
4    2
5    4

>>> out
  Datetime                   A   B  Datetime_nearest           A_nearest  B_nearest
0 2016-05-15 08:36:21 06:00  21   3                       NaT   NaN        NaN     
1 2016-05-15 08:36:41 06:00  43   3 2016-05-15 08:36:21 06:00  21.0        3.0     
2 2016-05-15 08:36:50 06:00  54  45 2016-05-15 08:36:21 06:00  21.0        3.0     
3 2016-05-15 08:37:10 06:00   2  23 2016-05-15 08:36:50 06:00  54.0       45.0     
4 2016-05-15 08:37:19 06:00  54   8 2016-05-15 08:36:50 06:00  54.0       45.0     
5 2016-05-15 08:37:39 06:00  67   6 2016-05-15 08:37:19 06:00  54.0        8.0

Testing

# 1. make a function with the solution above
def select(df, min_dt='20s'):
    newcols = [f'{k}_nearest' for k in df.columns]
    z = (
        df.assign(rownum=range(len(df)))
        .rolling(pd.Timedelta(min_dt), on='Datetime', closed='right')['rownum']
        .apply(min).astype(int) - 1
    )
    return pd.concat([
        df,
        df.iloc[z].set_axis(newcols, axis=1).reset_index(drop=True).where(z >= 0)
    ], axis=1)


# 2. make a generator of test data (with just the right columns)
import numpy as np

def gen(n):
    return pd.DataFrame({
        'Datetime': np.random.randint(1, 30, n).cumsum() * pd.Timedelta('1s')   pd.Timestamp('2020'),
        'A': np.random.randint(0, 100, n),
        'B': np.random.randint(0, 100, n),
    })

Example

np.random.seed(0)
df = gen(10)
>>> select(df)
             Datetime   A   B    Datetime_nearest  A_nearest  B_nearest
0 2020-01-01 00:00:13  21  87                 NaT        NaN        NaN
1 2020-01-01 00:00:29  36  46                 NaT        NaN        NaN
2 2020-01-01 00:00:51  87  88 2020-01-01 00:00:29       36.0       46.0
3 2020-01-01 00:00:52  70  81 2020-01-01 00:00:29       36.0       46.0
4 2020-01-01 00:00:56  88  37 2020-01-01 00:00:29       36.0       46.0
5 2020-01-01 00:01:24  88  25 2020-01-01 00:00:56       88.0       37.0
6 2020-01-01 00:01:28  12  77 2020-01-01 00:00:56       88.0       37.0
7 2020-01-01 00:01:36  58  72 2020-01-01 00:00:56       88.0       37.0
8 2020-01-01 00:01:46  65   9 2020-01-01 00:01:24       88.0       25.0
9 2020-01-01 00:02:06  39  20 2020-01-01 00:01:46       65.0        9.0

Speed

df = gen(10_000)
%timeit select(df)

# 398 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Q & A

why is this not O[n^2]?

rolling is O[n] (on a sorted index), because it maintains two pointers, one for the left index of the window, one for the right. It just loops over the right index (O[n]) and increments the left index only if needed (also O[n]). Note that we use min as the function to apply, just for convenience. We could instead use a select_first to just get the first value, but that practically doesn't save much time.
why use an auxiliary variable rownum and not simply the first index of each window (a Timestamp)?

That's because rolling is rather picky at what functions it can apply. It does want a numerical value (a float, really) and croaks if the applied function returns for example a Timestamp. In addition, it is helpful to have a simple row number, esp. in the case where several rows may have the same Datetime.
why are we subtracting 1 in the expression for z?

That is because the problem statement is: find the previous row that is at least 20 s out. The way we solve that is by taking a window that is left-open, right-closed (so exactly 20 s would be outside the window), then we subtract 1 to get the first row that is at least 20 seconds away from the current. The negative values that may appear are indicating that no such row exists, and are replaced by NaN.

CodePudding user response：

The following solution is memory-efficient but it is not the fastest one (because it uses iteration over rows).

The fully vectorized version (that I can think of) would be faster but it would use O(n^2) memory.

Example dataframe:

timestamps = [pd.Timestamp('2016-01-01 00:00:00'),
              pd.Timestamp('2016-01-01 00:00:19'),
              pd.Timestamp('2016-01-01 00:00:20'),
              pd.Timestamp('2016-01-01 00:00:21'),
              pd.Timestamp('2016-01-01 00:00:50')]
df = pd.DataFrame({'Datetime': timestamps,
                   'A': np.arange(10, 15),
                   'B': np.arange(20, 25)})

	Datetime	A	B
0	2016-01-01 00:00:00	10	20
1	2016-01-01 00:00:19	11	21
2	2016-01-01 00:00:20	12	22
3	2016-01-01 00:00:21	13	23
4	2016-01-01 00:00:50	14	24

Solution:

times = df['Datetime'].to_numpy()  # it's convenient to have it as an `ndarray`
shifted_times = times - pd.Timedelta(20, unit='s')

useful is a list of "useful" indices of df - i.e. where the appended values will NOT be nan:

useful = np.nonzero(shifted_times >= times[0])[0]
# useful == [2, 3, 4]

Truncate shifted_times from the beginning - to iterate through useful elements only:

if len(useful) == 0:
    # all new columns will be `nan`s
    first_i = 0  # this value will never actually be used
    useful_shifted_times = np.array([], dtype=shifted_times.dtype)
else:
    first_i = useful[0]  # first_i == 2
    useful_shifted_times = shifted_times[first_i : ]

Find the corresponding index positions of df for each "useful" value.

(these index positions are essentially the indices of times that are selected for each element of useful_shifted_times):

selected_indices = []

# Iterate through `useful_shifted_times` one by one:
# (`i` starts at `first_i`)
for i, shifted_time in enumerate(useful_shifted_times, first_i):
    selected_index = np.nonzero(times[: i] <= shifted_time)[0][-1]
    selected_indices.append(selected_index)

# selected_indices == [0, 0, 3]

Selected rows:

df_nearest = df.iloc[selected_indices].add_suffix('_nearest')

	Datetime_nearest	A_nearest	B_nearest
0	2016-01-01 00:00:00	10	20
0	2016-01-01 00:00:00	10	20
3	2016-01-01 00:00:21	13	23

Replace indices of df_nearest to match those of the corresponding rows of df.

(basically, that is the last len(selected_indices) indices):

df_nearest.index = df.index[len(df) - len(selected_indices) : ]

	Datetime_nearest	A_nearest	B_nearest
2	2016-01-01 00:00:00	10	20
3	2016-01-01 00:00:00	10	20
4	2016-01-01 00:00:21	13	23

Append the selected rows to the original dataframe to get the final result:

new_df = df.join(df_nearest)

	Datetime	A	B	Datetime_nearest	A_nearest	B_nearest
0	2016-01-01 00:00:00	10	20	NaT	nan	nan
1	2016-01-01 00:00:19	11	21	NaT	nan	nan
2	2016-01-01 00:00:20	12	22	2016-01-01 00:00:00	10	20
3	2016-01-01 00:00:21	13	23	2016-01-01 00:00:00	10	20
4	2016-01-01 00:00:50	14	24	2016-01-01 00:00:21	13	23

Note: NaT stands for 'Not a Time'. It is the equivalent of nan for time values.

Note: it also works as expected even if all the last 'Datetime' - 20 sec is before the very first 'Datetime' --> all new columns will be nans.