I have two dataframes like this:
codename =
id code region
1 AAA Alpha
2 BBB Beta
3 CCC Gamma
4 DDD Delta
... ... ...
list =
id region code
1 BBB
2 DDD1
3 AAA
4 CCC10
5 AAA2
... ...
I want to fill the region column in the second dataframe by the code in the first dataframe. How do I compare these two code columns because in the second dataframe, the code have number but still represent the same region as the first three letter code.
Both of my datasets are quite big so is there any way to insert the value fastest. Thank you in advance!
CodePudding user response:
What you want to do is called a join - i.e., fill in values in one table from another table according to agreement on a key column. pandas knows how to do this (doc)
First, you need to clean up the column you're joining on:
# create a new column with the first 3 letters of values in the 'code' column
list['code_clean'] = list['code'].str.slice(0, 2) # keep first 3 letters
# drop the empty column from the list df so there's no overlap in the target column
list.drop('region', axis=1, inplace=True)
Now we can join on the key column (in your case it's the 'code' column). pandas requires that the column be the index for the 'other' dataframe:
list = list.join(codename.set_index('code'), on='code_clean')
list
out:
id region code code_clean
1 Beta BBB BBB
2 Delta DDD1 DDD
3 Alpha AAA AAA
4 Gamma CCC10 CCC
5 Alpha AAA2 AAA
Also, never use a python built-in name for a variable name (the "list" dataframe). It can and will lead to unexpected behavior.