Machine learning model to predict column names based on column data?-CodePudding

I'm trying to build a Python machine learning model that can predict column names of unseen column data, based on previous datasets. For a simplified example a training dataframe can look like:

Currency	Security Number
USD	000402625
CAD	001477825
USD	200398025
USD	000403458
JPY	099402464
EUR	458592625

where the model would find a way to distinguish currencies from security numbers, and then feeding this test dataframe to the model:

X	Y
CAD	500235025
CAD	200394855
EUR	999398025
EUR	234890578
USD	980758345
JPY	123754890

would identify column X = Currency and column Y = Security Number

I've did research and couldn't find anything that would allow predictions based on full column data, any help would be appreciated.

CodePudding user response：

Since all the possible currencies are known you can get 100% accuracy by simply checking from a known list instead of making a prediction with a model.

But generally speaking, you can put all your data into one huge excel sheet, each row has a value and label. Then you shuffle your rows to make it random, and then you can train the whole thing.

Value	Label
USD	Currency
001477825	Security Number
000403458	Security Number
EUR	Currency

If you add enough data you should be able to predict that "BLA" is a currency and that 349834989 is a security number. Both are not correct but should be close enough to what you need :) This is what happens if you use machine learning :)

BUT

You will run into problems if you have several columns that all have numbers. In that case, the numbers need to have a pattern that can be associated with that column. That might simply not be the case.