Let it be the following Python Panda DataFrame:
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| 0 | 2022-04-01 10:00:01 | IN | UK |
| unknown | 2022-04-01 10:00:03 | IN | UK |
| 0 | 2022-04-01 12:00:01 | OUT | UK |
| 0 | 2022-04-01 12:30:11 | IN | GER |
| 1 | 2022-04-01 10:00:00 | IN | GER |
| 1 | 2022-04-01 08:04:03 | OUT | GER |
| unknown | 2022-04-01 10:20:02 | OUT | USA |
| unknown | 2022-04-01 09:59:58 | IN | GER |
| unknown | 2022-04-01 05:04:03 | OUT | ITL |
| unknown | 2022-04-01 05:04:01 | OUT | ITL |
| 2 | 2022-04-01 05:03:59 | OUT | ITL |
I need to create a DataFrame, containing rows with ID value unknown, that have a matching record with direction and country_ID values 2 seconds (it can be changed) apart in time, but the ID of the row it matches is different from unknown.
All rows unknown:
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 10:00:03 | IN | UK |
| unknown | 2022-04-01 10:20:02 | OUT | USA |
| unknown | 2022-04-01 09:59:58 | IN | GER |
| unknown | 2022-04-01 05:04:03 | OUT | ITL |
| unknown | 2022-04-01 05:04:01 | OUT | ITL |
Matching examples for each row specified above:
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 10:00:03 | IN | UK |
| 0 | 2022-04-01 10:00:01 | IN | UK |
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 10:20:02 | OUT | USA |
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 09:59:59 | IN | GER |
| 1 | 2022-04-01 10:00:00 | IN | GER |
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 05:04:03 | OUT | ITL |
| ID | date | direction | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown | 2022-04-01 05:04:01 | OUT | ITL |
| 2 | 2022-04-01 05:03:59 | OUT | ITL |
We remove those that have not any match. We get the resulting DataFrame:
| ID | date | direction | country_ID | date_match | ID_match |
|-----------|-------------------------|---------------|------------|----------------------|---------------|
| unknown | 2022-04-01 10:00:03 | IN | UK | 2022-04-01 10:00:01 | 0 |
| unknown | 2022-04-01 09:59:58 | IN | GER | 2022-04-01 10:00:00 | 1 |
| unknown | 2022-04-01 05:04:01 | OUT | ITL | 2022-04-01 05:03:59 | 2 |
Thank you in advance for your help.
CodePudding user response:
You can use a mask to split the dataframe in two and pandas.merge_asof
to find the matches by group and within 2 seconds:
df['date'] = pd.to_datetime(df['date'])
mask = df['ID'].eq('unknown')
idx = (pd
.merge_asof(df[mask].sort_values(by='date').reset_index(),
df[~mask].sort_values(by='date'),
by=['direction', 'country_ID'],
on='date',
direction='nearest', tolerance=pd.Timedelta('2s'),
)
.loc[lambda d: d['ID_y'].notna(), 'index']
)
df.loc[sorted(idx)]
output:
ID date direction country_ID
1 unknown 2022-04-01 10:00:03 IN UK
7 unknown 2022-04-01 09:59:58 IN GER
9 unknown 2022-04-01 05:04:01 OUT ITL
with merged data
df2 = (pd
.merge_asof(df[mask].sort_values(by='date').reset_index(),
df[~mask].sort_values(by='date').rename(columns={'date': 'date_match'}),
by=['direction', 'country_ID'],
left_on='date', right_on='date_match',
direction='nearest', tolerance=pd.Timedelta('2s'),
suffixes=('', '_match')
)
.loc[lambda d: d['ID_match'].notna()]
.set_index('index').sort_index()
)
output:
ID date direction country_ID ID_match date_match
index
1 unknown 2022-04-01 10:00:03 IN UK 0 2022-04-01 10:00:01
7 unknown 2022-04-01 09:59:58 IN GER 1 2022-04-01 10:00:00
9 unknown 2022-04-01 05:04:01 OUT ITL 2 2022-04-01 05:03:59