In sklearn, the document of QuantileTransformer says
This method transforms the features to follow a uniform or a normal distribution
the document of PowerTransformer says,
Apply a power transform featurewise to make data more Gaussian-like
It seems both of them can transform features to a gaussian/normal distribution. What are the differences in terms of this aspect and when to use which ?
CodePudding user response:
It is confusing terminology that they use because Gaussian and normal distribution are actually the SAME.
QuantileTransformer and PowerTransformer are both non-linear.
To answer your question about what exactly is the difference it is this according to https://scikit-learn.org:
"QuantileTransformer provides non-linear transformations in which distances between marginal outliers and inliers are shrunk. PowerTransformer provides non-linear transformations in which data is mapped to a normal distribution to stabilize variance and minimize skewness. "
Source and more info here: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#:~:text=QuantileTransformer provides non-linear transformations,stabilize variance and minimize skewness.
CodePudding user response:
The main difference is PowerTransformer()
being parametric and QuantileTransformer()
being non-parametric. Box-Cox or Yeo-Johnson will make your data look more 'normal' (i.e. less skewed and more centered) but it's often still far from the perfect gaussian. QuantileTransformer(output_distribution='normal')
results usually look much closer to gaussian, at the cost of distorting linear relationships somewhat more. I believe there's no rule of thumb to decide which one would work better in a certain case, but it's worth noting you can select an optimal scaler in a pipeline when doing e.g. GridSearchCV()
.