Home > front end >  Machine learning model to predict column names based on column data?
Machine learning model to predict column names based on column data?

Time:08-11

I'm trying to build a Python machine learning model that can predict column names of unseen column data, based on previous datasets. For a simplified example a training dataframe can look like:

Currency Security Number
USD 000402625
CAD 001477825
USD 200398025
USD 000403458
JPY 099402464
EUR 458592625

where the model would find a way to distinguish currencies from security numbers, and then feeding this test dataframe to the model:

X Y
CAD 500235025
CAD 200394855
EUR 999398025
EUR 234890578
USD 980758345
JPY 123754890

would identify column X = Currency and column Y = Security Number

I've did research and couldn't find anything that would allow predictions based on full column data, any help would be appreciated.

CodePudding user response:

Since all the possible currencies are known you can get 100% accuracy by simply checking from a known list instead of making a prediction with a model.

But generally speaking, you can put all your data into one huge excel sheet, each row has a value and label. Then you shuffle your rows to make it random, and then you can train the whole thing.

Value Label
USD Currency
001477825 Security Number
000403458 Security Number
EUR Currency

If you add enough data you should be able to predict that "BLA" is a currency and that 349834989 is a security number. Both are not correct but should be close enough to what you need :) This is what happens if you use machine learning :)

BUT

You will run into problems if you have several columns that all have numbers. In that case, the numbers need to have a pattern that can be associated with that column. That might simply not be the case.

  • Related