I'm trying to build a Python machine learning model that can predict column names of unseen column data, based on previous datasets. For a simplified example a training dataframe can look like:
Currency | Security Number |
---|---|
USD | 000402625 |
CAD | 001477825 |
USD | 200398025 |
USD | 000403458 |
JPY | 099402464 |
EUR | 458592625 |
where the model would find a way to distinguish currencies from security numbers, and then feeding this test dataframe to the model:
X | Y |
---|---|
CAD | 500235025 |
CAD | 200394855 |
EUR | 999398025 |
EUR | 234890578 |
USD | 980758345 |
JPY | 123754890 |
would identify column X = Currency and column Y = Security Number
I've did research and couldn't find anything that would allow predictions based on full column data, any help would be appreciated.
CodePudding user response:
Since all the possible currencies are known you can get 100% accuracy by simply checking from a known list instead of making a prediction with a model.
But generally speaking, you can put all your data into one huge excel sheet, each row has a value and label. Then you shuffle your rows to make it random, and then you can train the whole thing.
Value | Label |
---|---|
USD | Currency |
001477825 | Security Number |
000403458 | Security Number |
EUR | Currency |
If you add enough data you should be able to predict that "BLA" is a currency and that 349834989 is a security number. Both are not correct but should be close enough to what you need :) This is what happens if you use machine learning :)
BUT
You will run into problems if you have several columns that all have numbers. In that case, the numbers need to have a pattern that can be associated with that column. That might simply not be the case.