Python - How to compare two columns with mixed character strings but still represent the same value?-CodePudding

I have two dataframes like this:

codename = 

id       code       region
1        AAA        Alpha
2        BBB        Beta
3        CCC        Gamma
4        DDD        Delta
...      ...        ...

list = 

id       region     code
1                   BBB
2                   DDD1
3                   AAA
4                   CCC10
5                   AAA2
...                 ...

I want to fill the region column in the second dataframe by the code in the first dataframe. How do I compare these two code columns because in the second dataframe, the code have number but still represent the same region as the first three letter code.

Both of my datasets are quite big so is there any way to insert the value fastest. Thank you in advance!

CodePudding user response：

What you want to do is called a join - i.e., fill in values in one table from another table according to agreement on a key column. pandas knows how to do this (doc)

First, you need to clean up the column you're joining on:

# create a new column with the first 3 letters of values in the 'code' column
list['code_clean'] = list['code'].str.slice(0, 2)  # keep first 3 letters
# drop the empty column from the list df so there's no overlap in the target column
list.drop('region', axis=1, inplace=True)

Now we can join on the key column (in your case it's the 'code' column). pandas requires that the column be the index for the 'other' dataframe:

list = list.join(codename.set_index('code'), on='code_clean')
list

out:

id       region     code     code_clean
1        Beta       BBB      BBB
2        Delta      DDD1     DDD
3        Alpha      AAA      AAA
4        Gamma      CCC10    CCC
5        Alpha      AAA2     AAA

Also, never use a python built-in name for a variable name (the "list" dataframe). It can and will lead to unexpected behavior.