I have a file with columns A, B, C.
Next week I will get a new file with Columns A, B, C.
A and B are a unique combination. I want to generate a unique id based on these 2 fields.
When I get the new file, I want to apply the same thing to it, so if combination A, B is also in next week’s file, it will have the same ID as the previous week.
CodePudding user response:
One way is to maintain a dataset with unique ids which you have already created, e.g., in a pickle file. Let's call this tracker
. Whenever you find new unique combinations for columns A and B, you add them to the dataset. To create ids, you can use pd.factorize
. Be aware to ensure that new ids have to be different from the ones that are already created and included in the tracker.
This is a minimum working example:
import pandas as pd
import numpy as np
def factorize_data(df, tracker):
# merge already existing identifers
df_factorized = df.merge(
tracker,
on=['A','B'],
how='left'
)
# get rows which do not yet have a unique id
untracked = df_factorized.loc[df_factorized['id'].isnull(), ['A','B']]
if not untracked.empty:
# create new id
untracked['id'] = pd.factorize(pd._libs.lib.fast_zip([untracked.A.values, untracked.B.values]))[0].astype(int)
# ensure that new ids are not confused with old ids
max_id = 0 if tracker.empty else tracker['id'].max()
untracked['id'] = max_id 1
# update tracker
tracker_updated = pd.concat([tracker, untracked])
# update dataframe with new ids
df_factorized = df_factorized.merge(
untracked,
on=['A','B'],
how='left',
suffixes=('_old', '_new')
)
# combine old and new ids
df_factorized['id'] = np.where(df_factorized['id_old'].isnull(), df_factorized['id_new'], df_factorized['id_old'])
df_factorized['id'] = df_factorized['id'].astype(int)
df_factorized.drop(columns=['id_new','id_old'], inplace=True)
return df_factorized, tracker_updated
return df_factorized, tracker
# initialize tracking dataframe
tracker = pd.DataFrame(columns=['A','B','id'])
# data today
df_today = pd.DataFrame({
'A': [1,2,3],
'B': [4,5,6],
'C': [7,8,9]
})
# data tomorrow
df_tomorrow = pd.DataFrame({
'A': [1,4,5],
'B': [4,10,11],
'C': [7,12,13]
})
# create new column with unique ids and update tracker if necessary
df_today_factorized, tracker_updated = factorize_data(df_today, tracker)
# overwrite tracker
tracker = tracker_updated
# create new column with unique ids and update tracker if necessary
df_tomorrow_factorized, tracker_updated = factorize_data(df_tomorrow, tracker)
It transform df_today
from
A B C
0 1 4 7
1 2 5 8
2 3 6 9
to
A B C id
0 1 4 7 1
1 2 5 8 2
2 3 6 9 3
and updates the initialized, empty tracker to the following dataset
A B id
0 1 4 1
1 2 5 2
2 3 6 3
Applying the procedure to tomorrow's dataframe, df_tomorrow
,
A B C
0 1 4 7
1 4 10 12
2 5 11 13
transforms it to
A B C id
0 1 4 7 1
1 4 10 12 4
2 5 11 13 5
and update the tracker to
A B id
0 1 4 1
1 2 5 2
2 3 6 3
1 4 10 4
2 5 11 5
Note that the id of the first row in df_tomorrow
relies on the id found in df_today
while the other ones are created and added to the tracker.