Algorithm to know how often the event happened during some period of time within greater time range-CodePudding

For example, I have Pandas dataset df of two columns: ['number'] and ['date']. In ['number'] column various numbers are placed, in ['date'] column date and time, when the number appeared, were placed in Epoch Unix Timestamp format.

How can I count whether the particular number, for instance 20, appeared more than 15 times during time period of 100 seconds? (whole time range (df['date'][n] - df['date'][0] is much greater than 100 seconds)

I wrote the Python code shown below. It works. But the problem is that if dataset is very big (millions of rows) and the event, I am looking for, occurs for example closer to the end of the dataset, it takes a lot of time to get the result. How can I make it work faster?

count = 0
time = 0
length = len(df['number'])
for i in range(length):
    time = df['date'][i]   100
    for j in range(1, length):
        if df['date'][j] > time:
            count = df['number'][(df['date'] >= df['date'][i]) & (df['date'] <= df['date'][j-1]) & (df['number'] == 20)].count()
            break
        if j == length - 1:
            count = df['number'][(df['date'] >= df['date'][i]) & (df['date'] <= df['date'][j]) & (df['number'] == 20)].count()
            break
            
    if count > 15:
        print('Number 20 appeared more than 15 times within the period of 100 seconds')
        break

CodePudding user response：

I don't follow the logic of your code. For a given index i, you set a time boundary of date[i] 100, then look up through the entire dataframe for rows with time greater than that boundary.

Anyway, I'm going from the text of your question instead. The following counts the maximum number of times each number appears within a given interval.

# reproducible setup
np.random.seed(0)
n = 100
interval = pd.Timedelta(100, 's')
ratio = 10
df = pd.DataFrame({
    'date': pd.Series(
        np.random.uniform(0, n / ratio, n) * interval   pd.Timestamp('2000')
    ).dt.round('s'),
    'number': np.random.choice([0,10,20,30], n),
}).set_index('date').sort_index()

The above builds a 100-row DataFrame with random dates and numbers, e.g.:

>>> df.head()
                     number
date                       
2000-01-01 00:00:05      10
2000-01-01 00:00:19       0
2000-01-01 00:00:20      20
2000-01-01 00:00:20       0
2000-01-01 00:00:39       0

Now, here is an expression that uses a RollingGroupBy to count, for each number and date, how many times the given number has been seen in the last 100 seconds:

z = df.assign(n=1).groupby('number')['n'].rolling(interval).sum().astype(int)

>>> z.head()
number  date               
0       2000-01-01 00:00:19    1
        2000-01-01 00:00:20    2
        2000-01-01 00:00:39    3
        2000-01-01 00:01:27    4
        2000-01-01 00:02:00    3
Name: n, dtype: int64

Then:

>>> z.groupby('number').max()
number
0     7
10    5
20    7
30    7
Name: n, dtype: int64

The above tells use that, for number 20, the maximum count of occurrences in a 100-second interval is 7, and:

>>> z.groupby('number').idxmax().str[-1]
number
0     2000-01-01 00:03:03
10    2000-01-01 00:10:18
20    2000-01-01 00:01:59
30    2000-01-01 00:13:58

it occurred on 2000-01-01 00:01:59.

Indeed:

tmax = z.groupby('number').idxmax().str[-1][20]
zz = df[df['number'] == 20].truncate(before=tmax - interval, after=tmax)

>>> zz
                     number
date                       
2000-01-01 00:00:20      20
2000-01-01 00:01:04      20
2000-01-01 00:01:11      20
2000-01-01 00:01:34      20
2000-01-01 00:01:37      20
2000-01-01 00:01:42      20
2000-01-01 00:01:59      20

>>> len(zz)
7

Speed

To measure speed, try e.g. to replace in the setup:

n = 1_000_000

And later:

%timeit df.assign(n=1).groupby('number')['n'].rolling(interval).sum().astype(int)
# 350 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, less than 1/2 second for 1 million rows.