How to assign identical random IDs conditionally to "related" rows in pandas?-CodePudding

New to Python I'm struggling with the problem to assign some random IDs to "related" rows where the relation is simply their proximity (within 14 days) in consecutive days grouped by user. In that example I chose uuidwithout any specific intention. It could be any other random IDs uniquely indentifying conceptually related rows.

        import pandas as pd
        import uuid
        import numpy as np

Here is a dummy dataframe:

        dummy_df = pd.DataFrame({"transactionid": [1, 2, 3, 4, 5, 6, 7, 8],
                                "user": ["michael", 
                                                    "michael", 
                                                    "michael", 
                                                    "tom", 
                                                    "tom", 
                                                    "tom",
                                                    "tom", 
                                                    "tom"],
                                "transactiontime": pd.to_datetime(["2022-01-01", 
                                                                "2022-01-02", 
                                                                "2022-01-03", 
                                                                "2022-09-01", 
                                                                "2022-09-13",
                                                                "2022-10-17",
                                                                "2022-10-20",
                                                                "2022-11-17"])})
        dummy_df.head(10)
        
        transactionid   user    transactiontime
                0   1   michael 2022-01-01
                1   2   michael 2022-01-02
                2   3   michael 2022-01-03
                3   4   tom     2022-09-01
                4   5   tom     2022-09-13
                5   6   tom     2022-10-17
                6   7   tom     2022-10-20
                7   8   tom     2022-11-17

Here I sort transactions and calculate their difference in days:

        dummy_df = dummy_df.assign(
            timediff = dummy_df
            .sort_values('transactiontime')
            .groupby(["user"])['transactiontime'].diff() / np.timedelta64(1, 'D')
            ).fillna(0)
        
        dummy_df.head(10)
        
        
        transactionid   user    transactiontime timediff
                0   1   michael 2022-01-01  0.0
                1   2   michael 2022-01-02  1.0
                2   3   michael 2022-01-03  1.0
                3   4   tom     2022-09-01  0.0
                4   5   tom     2022-09-13  12.0
                5   6   tom     2022-10-17  34.0
                6   7   tom     2022-10-20  3.0
                7   8   tom     2022-11-17  28.0

Here I create a new column with a random IDs for each related transaction - though it does not work as expected:

        dummy_df.assign(related_transaction = np.where((dummy_df.timediff >= 0) & (dummy_df.timediff < 15), uuid.uuid4(), dummy_df.transactionid))
        
        
        transactionid   user    transactiontime timediff    related_transaction
                0   1   michael 2022-01-01  0.0  fd630f07-6564-4773-aff9-44ecb1e4211d
                1   2   michael 2022-01-02  1.0  fd630f07-6564-4773-aff9-44ecb1e4211d
                2   3   michael 2022-01-03  1.0  fd630f07-6564-4773-aff9-44ecb1e4211d
                3   4   tom     2022-09-01  0.0  fd630f07-6564-4773-aff9-44ecb1e4211d
                4   5   tom     2022-09-13  12.0 fd630f07-6564-4773-aff9-44ecb1e4211d
                5   6   tom     2022-10-17  34.0    6
                6   7   tom     2022-10-20  3.0  fd630f07-6564-4773-aff9-44ecb1e4211d
                7   8   tom     2022-11-17  28.0    8

What I would expect is something like given that the user group difference between transactions is within 14 days:

        transactionid   user    transactiontime timediff    related_transaction
                0   1   michael 2022-01-01  0.0  ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
                1   2   michael 2022-01-02  1.0  ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
                2   3   michael 2022-01-03  1.0  ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
                3   4   tom     2022-09-01  0.0  b1da2251-7770-4756-8863-c82f90657542
                4   5   tom     2022-09-13  12.0 b1da2251-7770-4756-8863-c82f90657542
                5   6   tom     2022-10-17  34.0 485a8d97-80d1-4184-8fc8-99523f471527
                6   7   tom     2022-10-20  3.0  485a8d97-80d1-4184-8fc8-99523f471527
                7   8   tom     2022-11-17  28.0    8

CodePudding user response：

The mismatch between your code and your desired result is that uuid.uuid4() creates an ID a single time and assigns it to all the relevant rows defined by np.where(). Instead, you need to generate the IDs in a vectorized way.

Try the following approach:

df.loc[ROW_CONDITIONs, COLUMNS] = VECTORIZED_ID_GENERATOR

which for your example would be

dummy_df.loc[(dummy_df['timediff'] >= 0) & (dummy_df['timediff'] < 15), 'related_transaction'] = dummy_df.apply(lambda _: uuid.uuid4(), axis=1)

Take into account that this only solves your question of how to assign random IDs using uuid conditionally in Pandas. It looks to me that you also need to generate the same ID for the same user and for transactions every 15 days. My advice for that would be to generate a dataframe where every row is a combination of two transactions and add a condition saying that the users from both transactions need to be the same.

CodePudding user response：

Taking the idea from Luise, we start with an empty column for related_transaction. Then, we iterate through each row. For each date, we check if it is already part of a transaction. If so, continue. Otherwise, assign a new transaction to that date and all other dates within 15 following days for the same user:

import datetime
df = dummy_df
df['related_transaction'] = None
for i, row in dummy_df.iterrows():
    if df.loc[i].related_transaction is not None:
        # We already assigned that row
        continue
    df.loc[  # Select where:
        (df.transactiontime <= row.transactiontime   datetime.timedelta(days=15)) & #  Current row   15 days
        (df.user == row.user) &  # Same user
        (pd.isna(df.related_transaction)),  # Don't overwrite anything already assigned
        'related_transaction'  #  Set this column to:
    ] = uuid.uuid4()  # Assign new UUID

This gives the output:


    transactionid   user    transactiontime related_transaction
0               1   michael 2022-01-01      82d28e10-149b-481e-ba41-f5833662ba99
1               2   michael 2022-01-02      82d28e10-149b-481e-ba41-f5833662ba99
2               3   michael 2022-01-03      82d28e10-149b-481e-ba41-f5833662ba99
3               4   tom     2022-09-01      fa253663-8615-419a-afda-7646906024f0
4               5   tom     2022-09-13      fa253663-8615-419a-afda-7646906024f0
5               6   tom     2022-10-17      d6152d4b-1560-40e0-8589-bd8e3da363db
6               7   tom     2022-10-20      d6152d4b-1560-40e0-8589-bd8e3da363db
7               8   tom     2022-11-17      2a93d78d-b6f6-4f0f-bb09-1bc18361aa21

In your example, the dates are already sorted, that's an important assumption I'm making here!