For example, I have Pandas dataset df of two columns: ['number'] and ['date']. In ['number'] column various numbers are placed, in ['date'] column date and time, when the number appeared, were placed in Epoch Unix Timestamp format.
How can I count whether the particular number, for instance 20, appeared more than 15 times during time period of 100 seconds? (whole time range (df['date'][n] - df['date'][0] is much greater than 100 seconds)
I wrote the Python code shown below. It works. But the problem is that if dataset is very big (millions of rows) and the event, I am looking for, occurs for example closer to the end of the dataset, it takes a lot of time to get the result. How can I make it work faster?
count = 0
time = 0
length = len(df['number'])
for i in range(length):
time = df['date'][i] 100
for j in range(1, length):
if df['date'][j] > time:
count = df['number'][(df['date'] >= df['date'][i]) & (df['date'] <= df['date'][j-1]) & (df['number'] == 20)].count()
break
if j == length - 1:
count = df['number'][(df['date'] >= df['date'][i]) & (df['date'] <= df['date'][j]) & (df['number'] == 20)].count()
break
if count > 15:
print('Number 20 appeared more than 15 times within the period of 100 seconds')
break
CodePudding user response:
I don't follow the logic of your code. For a given index i
, you set a time
boundary of date[i] 100
, then look up through the entire dataframe for rows with time greater than that boundary.
Anyway, I'm going from the text of your question instead. The following counts the maximum number of times each number appears within a given interval.
# reproducible setup
np.random.seed(0)
n = 100
interval = pd.Timedelta(100, 's')
ratio = 10
df = pd.DataFrame({
'date': pd.Series(
np.random.uniform(0, n / ratio, n) * interval pd.Timestamp('2000')
).dt.round('s'),
'number': np.random.choice([0,10,20,30], n),
}).set_index('date').sort_index()
The above builds a 100-row DataFrame
with random dates and numbers, e.g.:
>>> df.head()
number
date
2000-01-01 00:00:05 10
2000-01-01 00:00:19 0
2000-01-01 00:00:20 20
2000-01-01 00:00:20 0
2000-01-01 00:00:39 0
Now, here is an expression that uses a RollingGroupBy
to count, for each number and date, how many times the given number has been seen in the last 100 seconds:
z = df.assign(n=1).groupby('number')['n'].rolling(interval).sum().astype(int)
>>> z.head()
number date
0 2000-01-01 00:00:19 1
2000-01-01 00:00:20 2
2000-01-01 00:00:39 3
2000-01-01 00:01:27 4
2000-01-01 00:02:00 3
Name: n, dtype: int64
Then:
>>> z.groupby('number').max()
number
0 7
10 5
20 7
30 7
Name: n, dtype: int64
The above tells use that, for number 20, the maximum count of occurrences in a 100-second interval is 7, and:
>>> z.groupby('number').idxmax().str[-1]
number
0 2000-01-01 00:03:03
10 2000-01-01 00:10:18
20 2000-01-01 00:01:59
30 2000-01-01 00:13:58
it occurred on 2000-01-01 00:01:59
.
Indeed:
tmax = z.groupby('number').idxmax().str[-1][20]
zz = df[df['number'] == 20].truncate(before=tmax - interval, after=tmax)
>>> zz
number
date
2000-01-01 00:00:20 20
2000-01-01 00:01:04 20
2000-01-01 00:01:11 20
2000-01-01 00:01:34 20
2000-01-01 00:01:37 20
2000-01-01 00:01:42 20
2000-01-01 00:01:59 20
>>> len(zz)
7
Speed
To measure speed, try e.g. to replace in the setup:
n = 1_000_000
And later:
%timeit df.assign(n=1).groupby('number')['n'].rolling(interval).sum().astype(int)
# 350 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, less than 1/2 second for 1 million rows.