I am trying to merge multiple dataframes to a master dataframe based on the columns in the master dataframes. For Example:
MASTER DF:
PO ID | Sales year | Name | Acc year |
---|---|---|---|
10 | 1934 | xyz | 1834 |
11 | 1942 | abc | 1842 |
SLAVE DF:
PO ID | Yr | Amount | Year |
---|---|---|---|
12 | 1935 | 365.2 | 1839 |
13 | 1966 | 253.9 | 1855 |
RESULTANT DF:
PO ID | Sales Year | Acc Year |
---|---|---|
10 | 1934 | 1834 |
11 | 1942 | 1842 |
12 | 1935 | 1839 |
13 | 1966 | 1855 |
Notice how I have manually mapped columns (Sales Year-->Yr and Acc Year-->Year) since I know they are the same quantity, only the column names are different.
I am trying to write some logic which can map them automatically based on some criteria (be it column names or the data type of that column) so that user does not need to map them manually.
If I map them by column name, both the columns have different names (Sales Year, Yr) and (Acc Year, Year). So to which column should the fourth column (Year) in the SLAVE DF be mapped in the MASTER DF?
Another way would be to map them based on their column values but again they are the same so cannot do that.
The logic should be able to map Yr to Sales Year and map Year to Acc Year automatically. Any idea/logic would be helpful.
Thanks in advance!
CodePudding user response:
I think safest is manually rename columns names.
df = df.rename(columns={'Yr':'Sales year','Sales year':'Sales Year',
'Year':'Acc Year','Acc Year':'Acc year'})
One idea is filter columns names for integers and if all values are between thresholds, here between 1800 and 2000, last set columns names:
df = df.set_index('PO ID')
df1 = df.select_dtypes('integer')
mask = (df1.gt(1800) & df1.lt(2000)).all().reindex(df.columns, fill_value=False)
df = df.loc[:, mask].set_axis(['Sales Year','Acc Year'], axis=1)
CodePudding user response:
Generally this is impossible as there is no solid/consistent factor by which we can map the columns.
That being said what one can do is use cosine similarity to calculate how similar one string (in this case the column name) is to other strings in another dataframe.
So in your case, we'll get 4 vectors for the first dataframe and 4 for the other one. Now calculate the cosine similarity between the first vector(PO ID) from the first dataframe and first vector from second dataframe (PO ID). This will return 100% as both the strings are same.
For each and every column, you'll get 4 confidence scores. Just pick the highest and map them.
That way you can get a makeshift logic through which you can map the column although there are loopholes in this logic too. But it is better than nothing as that way the number of columns to be mapped by the user will be less as compared to mapping them all manually.
Cheers!