I have features like Amount(20-100K$), Percent1(i.e. 0-1),Percent2(i.e.0-1). Here Amount values are between 20-100000 US dollars, and percent columns with decimals between 0-1. These features are positively skewed, so I applied log transformation on Amount, Yeo-Johnson using powertransformer on Percent1,Percent2 columns.
Is it right to apply different transformations on columns, will it have an effect on model performance or should I same same transformation on all columns?
CodePudding user response:
it's about understanding the benefits of transformation
so when we talk about some equation like f(x1,x2) = w1x1 w2x2
then if x1 is about 100,000 like the amount
and if x2 is about 1.0 like the percent
and at the same time feature number 1 will be updated 100,000 faster than feature 2
mathematically when you update the weight then the equation of weights will be like
- w1 = w1 - lr*(x1)
- w2 = w2 - lr*(x2)
lr here represents the learning rate
then you are saying the amount feature is lot lot better than the percent feature
and that's why usually one transform the data into the same distribution to not make a feature better than another feature
CodePudding user response:
There are some things that need to be known before we can answer the question.
The answer would depend on the model that you are using. In some models it’s better if the range of the different inputs are same. Some models are agnostic to that. And of course sometimes one might also be interested in assigning different priorities to the inputs.
To get back to your question, depending on the model, there might be absolutely no harm in applying different transformations, or there could be performance differences.
For example: Linear regression models would be greatly affected by the feature transformation. However supervised neural networks most likely wouldn’t.
You might want to check this stackoverflow
question: https://stats.stackexchange.com/questions/397227/why-feature-transformation-is-needed-in-machine-learning-statistics-doesnt-i