Modify pandas data frame values based on condition-CodePudding

Given a data frame with 9,000 records (rows), a column with unique IDs, and each ID can have multiple records as shown below.

df.head(4)

Unique_ID	Record_1	Record_2
AN5001	90.0	ten
AN5002	90.0	five
AN5001	95.0	five
AN5003	60.0	forty

There are 360 unique IDs. However, about half of them need to be corrected. Consider below df_corrected_ID.head(3)

Unique_ID_old	Unique_ID_new
AN5001	AN5010
AN5002	AN5002
AN5003	AN5011

How would you, most efficiently, fix the Unique ID in the main df with 9,000 records using the df_corrected_ID data frame?

So, check if Unique_ID_old is present in df['Unique_ID'] column and if it is, replace it with Unique_ID_new from df_corrected_ID.

How would one then check change occurred correctly (for example, just show the difference between the Unique_IDs -- say after converting the original and updated columns to lists and then, list(set(Unique_ID) - set(Unique_ID_new)).

It's okay to add another new column to original df if needed with corrected IDs, as long as the order is maintained an none of the records are changed.

Thanks!

CodePudding user response：

It is simple mapping. You can add a column or update in place.

df_map = pd.DataFrame(dict(uni=['AN5001','AN5002','AN5003'], uni_update=['AN5010','AN5002','AN5011'])) 
df_record = pd.DataFrame(dict(uni=['AN5001','AN5002','AN5001','AN5003'], rec1=[1,2,3,4], rec2=['a','b','c','d']))
    
df_record['uni_update'] = df_record.uni.map(dict(zip(df_map.uni, df_map.uni_update)))

df w/ new col

|    | uni    |   rec1 | rec2   | uni_update   |
|---:|:-------|-------:|:-------|:-------------|
|  0 | AN5001 |      1 | a      | AN5010       |
|  1 | AN5002 |      2 | b      | AN5002       |
|  2 | AN5001 |      3 | c      | AN5010       |
|  3 | AN5003 |      4 | d      | AN5011       |

to check afterwards

df_record_reduced = df_record.drop_duplicates(subset='uni').loc[:,['uni','uni_update']].reset_index(drop=True)

|    | uni    | uni_update   |
|---:|:-------|:-------------|
|  0 | AN5001 | AN5010       |
|  1 | AN5002 | AN5002       |
|  2 | AN5003 | AN5011       |

df_map.equals(df_record_reduced) #should be true if everything is aligned

CodePudding user response：

You can use Series.replace for this. Make your corrected ID DataFrame into a dictionary that maps old IDs to new ones, and you can use it to update your existing column or create a new one.

update_dict = dict(zip(df_corrected_ID['Unique_ID_old'], df_corrected_ID['Unique_ID_new'])

# to create a new column
df['Unique_ID_updated'] = df['Unique_ID'].replace(update_dict)

# to update the existing column
df['Unique_ID'] = df['Unique_ID'].replace(update_dict)