Home > Software engineering >  How to merge two dataframes with different variations of a column values?
How to merge two dataframes with different variations of a column values?

Time:11-24

I have two data frames that I want to merge on a same column name but the values can have different variations of a values.

Examples. Variations of a value :

Variations
USA
US
United States
United States of America
The United States of America

And let's suppose the data frames as below: df1 =

country column B
India Cell 2
China Cell 4
United States Cell 2
UK Cell 4

df2 =

Country clm
USA val1
CH val2
IN val3

Now how do I merge such that the United States is merged with USA?

I have tried DataFrame merge but it merges only on the matched values of the column name.

Is there a way to match the variations and merge the dataframes?

CodePudding user response:

You simply create a reftable then merge

Your data:

df = pd.DataFrame({'name':['USA', 'US', 'United States', 'FR', 'France'],
                   'val':[1,2,3,4,5]})
df

            name  val
0            USA    1
1             US    2
2  United States    3
3             FR    4
4         France    5

Your reftable:

reftable = pd.DataFrame({'name':['United States', 'US', 'USA', 'United States of America', 'The United States of America', 'France', 'FR', 'Frank'],
                         'uniqname':['us']*5 ['fr']*3})
reftable
                           name uniqname
0                 United States       us
1                            US       us
2                           USA       us
3      United States of America       us
4  The United States of America       us
5                        France       fr
6                            FR       fr
7                         Frank       fr

Now merge:

new = pd.merge(df, reftable, on='name', how='left')
new

            name  val uniqname
0            USA    1       us
1             US    2       us
2  United States    3       us
3             FR    4       fr
4         France    5       fr

CodePudding user response:

Use .count to count how many times United States is stated in the list and then make an if command to see if united stated is listed more than once in the list. Do it to all of the other options and make a final if command to check if either any of them are in the list to output the value that you want.

  • Related