New to Python I'm struggling with the problem to assign some random IDs to "related" rows
where the relation is simply their proximity (within 14 days) in consecutive days grouped by user. In that example I chose uuid
without any specific intention. It could be any other random IDs uniquely indentifying conceptually related rows.
import pandas as pd
import uuid
import numpy as np
Here is a dummy dataframe:
dummy_df = pd.DataFrame({"transactionid": [1, 2, 3, 4, 5, 6, 7, 8],
"user": ["michael",
"michael",
"michael",
"tom",
"tom",
"tom",
"tom",
"tom"],
"transactiontime": pd.to_datetime(["2022-01-01",
"2022-01-02",
"2022-01-03",
"2022-09-01",
"2022-09-13",
"2022-10-17",
"2022-10-20",
"2022-11-17"])})
dummy_df.head(10)
transactionid user transactiontime
0 1 michael 2022-01-01
1 2 michael 2022-01-02
2 3 michael 2022-01-03
3 4 tom 2022-09-01
4 5 tom 2022-09-13
5 6 tom 2022-10-17
6 7 tom 2022-10-20
7 8 tom 2022-11-17
Here I sort transactions and calculate their difference in days:
dummy_df = dummy_df.assign(
timediff = dummy_df
.sort_values('transactiontime')
.groupby(["user"])['transactiontime'].diff() / np.timedelta64(1, 'D')
).fillna(0)
dummy_df.head(10)
transactionid user transactiontime timediff
0 1 michael 2022-01-01 0.0
1 2 michael 2022-01-02 1.0
2 3 michael 2022-01-03 1.0
3 4 tom 2022-09-01 0.0
4 5 tom 2022-09-13 12.0
5 6 tom 2022-10-17 34.0
6 7 tom 2022-10-20 3.0
7 8 tom 2022-11-17 28.0
Here I create a new column with a random IDs for each related transaction - though it does not work as expected:
dummy_df.assign(related_transaction = np.where((dummy_df.timediff >= 0) & (dummy_df.timediff < 15), uuid.uuid4(), dummy_df.transactionid))
transactionid user transactiontime timediff related_transaction
0 1 michael 2022-01-01 0.0 fd630f07-6564-4773-aff9-44ecb1e4211d
1 2 michael 2022-01-02 1.0 fd630f07-6564-4773-aff9-44ecb1e4211d
2 3 michael 2022-01-03 1.0 fd630f07-6564-4773-aff9-44ecb1e4211d
3 4 tom 2022-09-01 0.0 fd630f07-6564-4773-aff9-44ecb1e4211d
4 5 tom 2022-09-13 12.0 fd630f07-6564-4773-aff9-44ecb1e4211d
5 6 tom 2022-10-17 34.0 6
6 7 tom 2022-10-20 3.0 fd630f07-6564-4773-aff9-44ecb1e4211d
7 8 tom 2022-11-17 28.0 8
What I would expect is something like given that the user group difference between transactions is within 14 days:
transactionid user transactiontime timediff related_transaction
0 1 michael 2022-01-01 0.0 ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
1 2 michael 2022-01-02 1.0 ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
2 3 michael 2022-01-03 1.0 ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
3 4 tom 2022-09-01 0.0 b1da2251-7770-4756-8863-c82f90657542
4 5 tom 2022-09-13 12.0 b1da2251-7770-4756-8863-c82f90657542
5 6 tom 2022-10-17 34.0 485a8d97-80d1-4184-8fc8-99523f471527
6 7 tom 2022-10-20 3.0 485a8d97-80d1-4184-8fc8-99523f471527
7 8 tom 2022-11-17 28.0 8
CodePudding user response:
The mismatch between your code and your desired result is that uuid.uuid4()
creates an ID a single time and assigns it to all the relevant rows defined by np.where()
. Instead, you need to generate the IDs in a vectorized way.
Try the following approach:
df.loc[ROW_CONDITIONs, COLUMNS] = VECTORIZED_ID_GENERATOR
which for your example would be
dummy_df.loc[(dummy_df['timediff'] >= 0) & (dummy_df['timediff'] < 15), 'related_transaction'] = dummy_df.apply(lambda _: uuid.uuid4(), axis=1)
Take into account that this only solves your question of how to assign random IDs using uuid
conditionally in Pandas. It looks to me that you also need to generate the same ID for the same user and for transactions every 15 days. My advice for that would be to generate a dataframe where every row is a combination of two transactions and add a condition saying that the users from both transactions need to be the same.
CodePudding user response:
Taking the idea from Luise, we start with an empty column for related_transaction
. Then, we iterate through each row. For each date, we check if it is already part of a transaction. If so, continue. Otherwise, assign a new transaction to that date and all other dates within 15 following days for the same user:
import datetime
df = dummy_df
df['related_transaction'] = None
for i, row in dummy_df.iterrows():
if df.loc[i].related_transaction is not None:
# We already assigned that row
continue
df.loc[ # Select where:
(df.transactiontime <= row.transactiontime datetime.timedelta(days=15)) & # Current row 15 days
(df.user == row.user) & # Same user
(pd.isna(df.related_transaction)), # Don't overwrite anything already assigned
'related_transaction' # Set this column to:
] = uuid.uuid4() # Assign new UUID
This gives the output:
transactionid user transactiontime related_transaction
0 1 michael 2022-01-01 82d28e10-149b-481e-ba41-f5833662ba99
1 2 michael 2022-01-02 82d28e10-149b-481e-ba41-f5833662ba99
2 3 michael 2022-01-03 82d28e10-149b-481e-ba41-f5833662ba99
3 4 tom 2022-09-01 fa253663-8615-419a-afda-7646906024f0
4 5 tom 2022-09-13 fa253663-8615-419a-afda-7646906024f0
5 6 tom 2022-10-17 d6152d4b-1560-40e0-8589-bd8e3da363db
6 7 tom 2022-10-20 d6152d4b-1560-40e0-8589-bd8e3da363db
7 8 tom 2022-11-17 2a93d78d-b6f6-4f0f-bb09-1bc18361aa21
In your example, the dates are already sorted, that's an important assumption I'm making here!