I have a series of data with dates and events that occurred on those dates (one line per event). I would like to know if a given event has occurred again two days after the first occurrence.
Example of a table :
date event
01/01/2000 a
01/01/2000 b
01/01/2000 c
01/01/2000 d
02/01/2000 e
02/01/2000 f
02/01/2000 g
02/01/2000 h
03/01/2000 i
03/01/2000 j
03/01/2000 k
03/01/2000 a
04/01/2000 l
04/01/2000 m
04/01/2000 b
04/01/2000 o
Here, event "a" occurred on 01/01/2000 and recurred on 03/01/2000. I would like to add a TRUE to this line. Here is the table I would like to get:
date event repeat_D2
01/01/2000 a TRUE
01/01/2000 b FALSE
01/01/2000 c FALSE
01/01/2000 d FALSE
02/01/2000 b TRUE
02/01/2000 f FALSE
02/01/2000 g FALSE
02/01/2000 h FALSE
03/01/2000 i FALSE
03/01/2000 j FALSE
03/01/2000 k FALSE
03/01/2000 a FALSE
04/01/2000 l FALSE
04/01/2000 m FALSE
04/01/2000 b FALSE
04/01/2000 o FALSE
The only solution I could think of to do this is with a "for" and "is in" loop; but that would mean checking the whole table row by row to see if somewhere there is the same event with a date equal to D 2 ... my real table being several hundred megabytes, I'm not sure this is optimal ...
How would you solve this problem?
CodePudding user response:
The best way to handle this type of issue is to join the table on itself and then check for rows where your condition is True
.
Example
### Setup Start ###
import pandas as pd
events = [
('01/01/2000', 'a'),
('01/01/2000', 'b'),
('01/01/2000', 'c'),
('01/01/2000', 'd'),
('02/01/2000', 'e'),
('02/01/2000', 'f'),
('02/01/2000', 'g'),
('02/01/2000', 'h'),
('03/01/2000', 'i'),
('03/01/2000', 'j'),
('03/01/2000', 'k'),
('03/01/2000', 'a'),
('04/01/2000', 'l'),
('04/01/2000', 'm'),
('04/01/2000', 'b'),
('04/01/2000', 'o')
]
df = pd.DataFrame({'date': [x[0] for x in events], 'event': [x[1] for x in events]})
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
### Setup End ###
# Self join df
dfj = df.join(df.set_index('event'), how='left', on='event', lsuffix='_original')
# Evaluate timedelta condition
dfj['repeat_D2'] = dfj.apply(lambda x : (x.date_original - x.date).days == -2, axis=1)
# Group and assign
df['repeat_D2'] = dfj.groupby(dfj.index ,axis=0).max()['repeat_D2']
print(df)
Output
index | date | event | repeat_D2 |
---|---|---|---|
0 | 2000-01-01 | a | True |
1 | 2000-01-01 | b | False |
2 | 2000-01-01 | c | False |
3 | 2000-01-01 | d | False |
4 | 2000-01-02 | e | False |
5 | 2000-01-02 | f | False |
6 | 2000-01-02 | g | False |
7 | 2000-01-02 | h | False |
8 | 2000-01-03 | i | False |
9 | 2000-01-03 | j | False |
10 | 2000-01-03 | k | False |
11 | 2000-01-03 | a | False |
12 | 2000-01-04 | l | False |
13 | 2000-01-04 | m | False |
14 | 2000-01-04 | b | False |
15 | 2000-01-04 | o | False |