Given a data frame with 9,000 records (rows), a column with unique IDs, and each ID can have multiple records as shown below.
df.head(4)
Unique_ID | Record_1 | Record_2 |
---|---|---|
AN5001 | 90.0 | ten |
AN5002 | 90.0 | five |
AN5001 | 95.0 | five |
AN5003 | 60.0 | forty |
There are 360 unique IDs. However, about half of them need to be corrected. Consider below df_corrected_ID.head(3)
Unique_ID_old | Unique_ID_new |
---|---|
AN5001 | AN5010 |
AN5002 | AN5002 |
AN5003 | AN5011 |
How would you, most efficiently, fix the Unique ID in the main df
with 9,000 records using the df_corrected_ID
data frame?
So, check if Unique_ID_old is present in df['Unique_ID']
column and if it is, replace it with Unique_ID_new from df_corrected_ID
.
How would one then check change occurred correctly (for example, just show the difference between the Unique_IDs -- say after converting the original and updated columns to lists and then, list(set(Unique_ID) - set(Unique_ID_new))
.
It's okay to add another new column to original df
if needed with corrected IDs, as long as the order is maintained an none of the records are changed.
Thanks!
CodePudding user response:
It is simple mapping. You can add a column or update in place.
df_map = pd.DataFrame(dict(uni=['AN5001','AN5002','AN5003'], uni_update=['AN5010','AN5002','AN5011']))
df_record = pd.DataFrame(dict(uni=['AN5001','AN5002','AN5001','AN5003'], rec1=[1,2,3,4], rec2=['a','b','c','d']))
df_record['uni_update'] = df_record.uni.map(dict(zip(df_map.uni, df_map.uni_update)))
df w/ new col
| | uni | rec1 | rec2 | uni_update |
|---:|:-------|-------:|:-------|:-------------|
| 0 | AN5001 | 1 | a | AN5010 |
| 1 | AN5002 | 2 | b | AN5002 |
| 2 | AN5001 | 3 | c | AN5010 |
| 3 | AN5003 | 4 | d | AN5011 |
to check afterwards
df_record_reduced = df_record.drop_duplicates(subset='uni').loc[:,['uni','uni_update']].reset_index(drop=True)
| | uni | uni_update |
|---:|:-------|:-------------|
| 0 | AN5001 | AN5010 |
| 1 | AN5002 | AN5002 |
| 2 | AN5003 | AN5011 |
df_map.equals(df_record_reduced) #should be true if everything is aligned
CodePudding user response:
You can use Series.replace for this. Make your corrected ID DataFrame into a dictionary that maps old IDs to new ones, and you can use it to update your existing column or create a new one.
update_dict = dict(zip(df_corrected_ID['Unique_ID_old'], df_corrected_ID['Unique_ID_new'])
# to create a new column
df['Unique_ID_updated'] = df['Unique_ID'].replace(update_dict)
# to update the existing column
df['Unique_ID'] = df['Unique_ID'].replace(update_dict)