Home > Net >  Check if a value is present at another relative date in a table
Check if a value is present at another relative date in a table

Time:11-03

I have a series of data with dates and events that occurred on those dates (one line per event). I would like to know if a given event has occurred again two days after the first occurrence.

Example of a table :

date    event
01/01/2000  a
01/01/2000  b
01/01/2000  c
01/01/2000  d
02/01/2000  e
02/01/2000  f
02/01/2000  g
02/01/2000  h
03/01/2000  i
03/01/2000  j
03/01/2000  k
03/01/2000  a
04/01/2000  l
04/01/2000  m
04/01/2000  b
04/01/2000  o

Here, event "a" occurred on 01/01/2000 and recurred on 03/01/2000. I would like to add a TRUE to this line. Here is the table I would like to get:

date    event   repeat_D2
01/01/2000  a   TRUE
01/01/2000  b   FALSE
01/01/2000  c   FALSE
01/01/2000  d   FALSE
02/01/2000  b   TRUE
02/01/2000  f   FALSE
02/01/2000  g   FALSE
02/01/2000  h   FALSE
03/01/2000  i   FALSE
03/01/2000  j   FALSE
03/01/2000  k   FALSE
03/01/2000  a   FALSE
04/01/2000  l   FALSE
04/01/2000  m   FALSE
04/01/2000  b   FALSE
04/01/2000  o   FALSE

The only solution I could think of to do this is with a "for" and "is in" loop; but that would mean checking the whole table row by row to see if somewhere there is the same event with a date equal to D 2 ... my real table being several hundred megabytes, I'm not sure this is optimal ...

How would you solve this problem?

CodePudding user response:

The best way to handle this type of issue is to join the table on itself and then check for rows where your condition is True.

Example

### Setup Start ###
import pandas as pd
events = [
('01/01/2000', 'a'),
('01/01/2000', 'b'),
('01/01/2000', 'c'),
('01/01/2000', 'd'),
('02/01/2000', 'e'),
('02/01/2000', 'f'),
('02/01/2000', 'g'),
('02/01/2000', 'h'),
('03/01/2000', 'i'),
('03/01/2000', 'j'),
('03/01/2000', 'k'),
('03/01/2000', 'a'),
('04/01/2000', 'l'),
('04/01/2000', 'm'),
('04/01/2000', 'b'),
('04/01/2000', 'o')
]

df = pd.DataFrame({'date': [x[0] for x in events], 'event': [x[1] for x in events]}) 
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
### Setup End ###

# Self join df
dfj = df.join(df.set_index('event'), how='left', on='event', lsuffix='_original')
# Evaluate timedelta condition
dfj['repeat_D2'] = dfj.apply(lambda x : (x.date_original - x.date).days == -2, axis=1)
# Group and assign
df['repeat_D2'] = dfj.groupby(dfj.index ,axis=0).max()['repeat_D2']

print(df)

Output

index date event repeat_D2
0 2000-01-01 a True
1 2000-01-01 b False
2 2000-01-01 c False
3 2000-01-01 d False
4 2000-01-02 e False
5 2000-01-02 f False
6 2000-01-02 g False
7 2000-01-02 h False
8 2000-01-03 i False
9 2000-01-03 j False
10 2000-01-03 k False
11 2000-01-03 a False
12 2000-01-04 l False
13 2000-01-04 m False
14 2000-01-04 b False
15 2000-01-04 o False
  • Related