I have two data frames that I want to merge on a same column name but the values can have different variations of a values.
Examples. Variations of a value :
Variations |
---|
USA |
US |
United States |
United States of America |
The United States of America |
And let's suppose the data frames as below: df1 =
country | column B |
---|---|
India | Cell 2 |
China | Cell 4 |
United States | Cell 2 |
UK | Cell 4 |
df2 =
Country | clm |
---|---|
USA | val1 |
CH | val2 |
IN | val3 |
Now how do I merge such that the United States is merged with USA?
I have tried DataFrame merge but it merges only on the matched values of the column name.
Is there a way to match the variations and merge the dataframes?
CodePudding user response:
You simply create a reftable
then merge
Your data:
df = pd.DataFrame({'name':['USA', 'US', 'United States', 'FR', 'France'],
'val':[1,2,3,4,5]})
df
name val
0 USA 1
1 US 2
2 United States 3
3 FR 4
4 France 5
Your reftable:
reftable = pd.DataFrame({'name':['United States', 'US', 'USA', 'United States of America', 'The United States of America', 'France', 'FR', 'Frank'],
'uniqname':['us']*5 ['fr']*3})
reftable
name uniqname
0 United States us
1 US us
2 USA us
3 United States of America us
4 The United States of America us
5 France fr
6 FR fr
7 Frank fr
Now merge:
new = pd.merge(df, reftable, on='name', how='left')
new
name val uniqname
0 USA 1 us
1 US 2 us
2 United States 3 us
3 FR 4 fr
4 France 5 fr
CodePudding user response:
Use .count to count how many times United States is stated in the list and then make an if command to see if united stated is listed more than once in the list. Do it to all of the other options and make a final if command to check if either any of them are in the list to output the value that you want.