Check if a value is present at another relative date in a table-CodePudding

I have a series of data with dates and events that occurred on those dates (one line per event). I would like to know if a given event has occurred again two days after the first occurrence.

Example of a table :

date    event
01/01/2000  a
01/01/2000  b
01/01/2000  c
01/01/2000  d
02/01/2000  e
02/01/2000  f
02/01/2000  g
02/01/2000  h
03/01/2000  i
03/01/2000  j
03/01/2000  k
03/01/2000  a
04/01/2000  l
04/01/2000  m
04/01/2000  b
04/01/2000  o

Here, event "a" occurred on 01/01/2000 and recurred on 03/01/2000. I would like to add a TRUE to this line. Here is the table I would like to get:

date    event   repeat_D2
01/01/2000  a   TRUE
01/01/2000  b   FALSE
01/01/2000  c   FALSE
01/01/2000  d   FALSE
02/01/2000  b   TRUE
02/01/2000  f   FALSE
02/01/2000  g   FALSE
02/01/2000  h   FALSE
03/01/2000  i   FALSE
03/01/2000  j   FALSE
03/01/2000  k   FALSE
03/01/2000  a   FALSE
04/01/2000  l   FALSE
04/01/2000  m   FALSE
04/01/2000  b   FALSE
04/01/2000  o   FALSE

The only solution I could think of to do this is with a "for" and "is in" loop; but that would mean checking the whole table row by row to see if somewhere there is the same event with a date equal to D 2 ... my real table being several hundred megabytes, I'm not sure this is optimal ...

How would you solve this problem?

CodePudding user response：

The best way to handle this type of issue is to join the table on itself and then check for rows where your condition is True.

Example

### Setup Start ###
import pandas as pd
events = [
('01/01/2000', 'a'),
('01/01/2000', 'b'),
('01/01/2000', 'c'),
('01/01/2000', 'd'),
('02/01/2000', 'e'),
('02/01/2000', 'f'),
('02/01/2000', 'g'),
('02/01/2000', 'h'),
('03/01/2000', 'i'),
('03/01/2000', 'j'),
('03/01/2000', 'k'),
('03/01/2000', 'a'),
('04/01/2000', 'l'),
('04/01/2000', 'm'),
('04/01/2000', 'b'),
('04/01/2000', 'o')
]

df = pd.DataFrame({'date': [x[0] for x in events], 'event': [x[1] for x in events]}) 
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
### Setup End ###

# Self join df
dfj = df.join(df.set_index('event'), how='left', on='event', lsuffix='_original')
# Evaluate timedelta condition
dfj['repeat_D2'] = dfj.apply(lambda x : (x.date_original - x.date).days == -2, axis=1)
# Group and assign
df['repeat_D2'] = dfj.groupby(dfj.index ,axis=0).max()['repeat_D2']

print(df)

Output

index	date	event	repeat_D2
0	2000-01-01	a	True
1	2000-01-01	b	False
2	2000-01-01	c	False
3	2000-01-01	d	False
4	2000-01-02	e	False
5	2000-01-02	f	False
6	2000-01-02	g	False
7	2000-01-02	h	False
8	2000-01-03	i	False
9	2000-01-03	j	False
10	2000-01-03	k	False
11	2000-01-03	a	False
12	2000-01-04	l	False
13	2000-01-04	m	False
14	2000-01-04	b	False
15	2000-01-04	o	False