Home > Mobile >  Modify pandas data frame values based on condition
Modify pandas data frame values based on condition

Time:12-10

Given a data frame with 9,000 records (rows), a column with unique IDs, and each ID can have multiple records as shown below.

df.head(4)

Unique_ID Record_1 Record_2
AN5001 90.0 ten
AN5002 90.0 five
AN5001 95.0 five
AN5003 60.0 forty

There are 360 unique IDs. However, about half of them need to be corrected. Consider below df_corrected_ID.head(3)

Unique_ID_old Unique_ID_new
AN5001 AN5010
AN5002 AN5002
AN5003 AN5011

How would you, most efficiently, fix the Unique ID in the main df with 9,000 records using the df_corrected_ID data frame?

So, check if Unique_ID_old is present in df['Unique_ID'] column and if it is, replace it with Unique_ID_new from df_corrected_ID.

How would one then check change occurred correctly (for example, just show the difference between the Unique_IDs -- say after converting the original and updated columns to lists and then, list(set(Unique_ID) - set(Unique_ID_new)).

It's okay to add another new column to original df if needed with corrected IDs, as long as the order is maintained an none of the records are changed.

Thanks!

CodePudding user response:

It is simple mapping. You can add a column or update in place.

df_map = pd.DataFrame(dict(uni=['AN5001','AN5002','AN5003'], uni_update=['AN5010','AN5002','AN5011'])) 
df_record = pd.DataFrame(dict(uni=['AN5001','AN5002','AN5001','AN5003'], rec1=[1,2,3,4], rec2=['a','b','c','d']))
    
df_record['uni_update'] = df_record.uni.map(dict(zip(df_map.uni, df_map.uni_update)))

df w/ new col

|    | uni    |   rec1 | rec2   | uni_update   |
|---:|:-------|-------:|:-------|:-------------|
|  0 | AN5001 |      1 | a      | AN5010       |
|  1 | AN5002 |      2 | b      | AN5002       |
|  2 | AN5001 |      3 | c      | AN5010       |
|  3 | AN5003 |      4 | d      | AN5011       |

to check afterwards

df_record_reduced = df_record.drop_duplicates(subset='uni').loc[:,['uni','uni_update']].reset_index(drop=True)

|    | uni    | uni_update   |
|---:|:-------|:-------------|
|  0 | AN5001 | AN5010       |
|  1 | AN5002 | AN5002       |
|  2 | AN5003 | AN5011       |

df_map.equals(df_record_reduced) #should be true if everything is aligned

CodePudding user response:

You can use Series.replace for this. Make your corrected ID DataFrame into a dictionary that maps old IDs to new ones, and you can use it to update your existing column or create a new one.

update_dict = dict(zip(df_corrected_ID['Unique_ID_old'], df_corrected_ID['Unique_ID_new'])

# to create a new column
df['Unique_ID_updated'] = df['Unique_ID'].replace(update_dict)

# to update the existing column
df['Unique_ID'] = df['Unique_ID'].replace(update_dict)
  • Related