Get records that are a time interval away from a given date and specific conditions on a pandas Data-CodePudding

Let it be the following Python Panda DataFrame:

|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
|   0       |    2022-04-01 10:00:01  |    IN         |    UK      |
| unknown   |    2022-04-01 10:00:03  |    IN         |    UK      |
|   0       |    2022-04-01 12:00:01  |    OUT        |    UK      |
|   0       |    2022-04-01 12:30:11  |    IN         |    GER     |
|   1       |    2022-04-01 10:00:00  |    IN         |    GER     |
|   1       |    2022-04-01 08:04:03  |    OUT        |    GER     |
| unknown   |    2022-04-01 10:20:02  |    OUT        |    USA     |
| unknown   |    2022-04-01 09:59:58  |    IN         |    GER     |
| unknown   |    2022-04-01 05:04:03  |    OUT        |    ITL     |
| unknown   |    2022-04-01 05:04:01  |    OUT        |    ITL     |
|   2       |    2022-04-01 05:03:59  |    OUT        |    ITL     |

I need to create a DataFrame, containing rows with ID value unknown, that have a matching record with direction and country_ID values 2 seconds (it can be changed) apart in time, but the ID of the row it matches is different from unknown.

All rows unknown:

|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown   |    2022-04-01 10:00:03  |    IN         |    UK      |
| unknown   |    2022-04-01 10:20:02  |    OUT        |    USA     |
| unknown   |    2022-04-01 09:59:58  |    IN         |    GER     |
| unknown   |    2022-04-01 05:04:03  |    OUT        |    ITL     |
| unknown   |    2022-04-01 05:04:01  |    OUT        |    ITL     |

Matching examples for each row specified above:

|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown   |    2022-04-01 10:00:03  |    IN         |    UK      |
|   0       |    2022-04-01 10:00:01  |    IN         |    UK      |

|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown   |    2022-04-01 10:20:02  |    OUT        |    USA     |

|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown   |    2022-04-01 09:59:59  |    IN         |    GER     |
|   1       |    2022-04-01 10:00:00  |    IN         |    GER     |

|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown   |    2022-04-01 05:04:03  |    OUT        |    ITL     |

|   ID      |    date                 |  direction    | country_ID |
|-----------|-------------------------|---------------|------------|
| unknown   |    2022-04-01 05:04:01  |    OUT        |    ITL     |
| 2         |    2022-04-01 05:03:59  |    OUT        |    ITL     |

We remove those that have not any match. We get the resulting DataFrame:

|   ID      |    date                 |  direction    | country_ID |  date_match          |   ID_match    |
|-----------|-------------------------|---------------|------------|----------------------|---------------|
| unknown   |    2022-04-01 10:00:03  |    IN         |    UK      |  2022-04-01 10:00:01 |    0          |
| unknown   |    2022-04-01 09:59:58  |    IN         |    GER     |  2022-04-01 10:00:00 |    1          |
| unknown   |    2022-04-01 05:04:01  |    OUT        |    ITL     |  2022-04-01 05:03:59 |    2          |

Thank you in advance for your help.

CodePudding user response：

You can use a mask to split the dataframe in two and pandas.merge_asof to find the matches by group and within 2 seconds:

df['date'] = pd.to_datetime(df['date'])

mask = df['ID'].eq('unknown')

idx = (pd
 .merge_asof(df[mask].sort_values(by='date').reset_index(),
             df[~mask].sort_values(by='date'),
             by=['direction', 'country_ID'],
             on='date',
             direction='nearest', tolerance=pd.Timedelta('2s'),
             )
 .loc[lambda d: d['ID_y'].notna(), 'index']
)

df.loc[sorted(idx)]

output:

        ID                date direction country_ID
1  unknown 2022-04-01 10:00:03        IN         UK
7  unknown 2022-04-01 09:59:58        IN        GER
9  unknown 2022-04-01 05:04:01       OUT        ITL

with merged data

df2 = (pd
 .merge_asof(df[mask].sort_values(by='date').reset_index(),
             df[~mask].sort_values(by='date').rename(columns={'date': 'date_match'}),
             by=['direction', 'country_ID'],
             left_on='date', right_on='date_match',
             direction='nearest', tolerance=pd.Timedelta('2s'),
             suffixes=('', '_match')
             )
 .loc[lambda d: d['ID_match'].notna()]
 .set_index('index').sort_index()

)

output:

            ID                date direction country_ID ID_match          date_match
index                                                                               
1      unknown 2022-04-01 10:00:03        IN         UK        0 2022-04-01 10:00:01
7      unknown 2022-04-01 09:59:58        IN        GER        1 2022-04-01 10:00:00
9      unknown 2022-04-01 05:04:01       OUT        ITL        2 2022-04-01 05:03:59