Consider a housing price dataset, where the goal is to predict the sale price.
I would like to do this by predicting the "Sale price per Squaremeter" instead, since it yields better results.
The question is if I implement it like this - does it introduce an information leak in the test set or not?
When I split my dataset in scikit learn:
df= read(Data)
target = df["SalePrice"]
df.drop(columns=["SalePrice"], inplace=True)
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.20)
And then scale y_train:
# Scale target by LivingSpace and call fit()
y_train = target/X_train["LivingSpace"]
estimator.fit(X_train, y_train)
And use predict and scale the target in y_test to get SalePrice per Squaremeter:
y_pred, y_true = estimator.predict(X_test), y_test/X_test["LivingSpace"]
I think this is valid, since I only scale the target by a known value. It should not make a difference if I predict the SalePrice
directly or SalePrice / LivingSpace
, since LivingSpace is given to me anyway when I predict the price.
If this holds true, we could also directly apply this target transformation to the train and test set and just transform the predicted values back in the end, right?
This should of course then also hold true for any feature given in X. As long as Information about the target itself is NOT present in X I see no problem here. Remember the true target is the SalePrice only, so my intention is to scale it back from sale price per squaremeter. The transformation is just used for better training results.
What are your thoughts about this code?
CodePudding user response:
All is good.
1· You are not leaking information.
2· You can directly apply this target transformation to train and test sets and transform them back after prediction.
3· You can do this for any feature given in X. You don't need a remark about information about the target. You can always transform y using X in any way you want, the only thing you are "leaking" is your own understanding about the problem at hand which is absolutely fine.
4· Your code has a bug, instead of
y_train = target/X_train["LivingSpace"]
you should have
y_train = y_train/X_train["LivingSpace"]
y_test = y_test/X_test["LivingSpace"]