Home > Blockchain >  Fit - Transform
Fit - Transform

Time:05-13

We wrote such code at a place where I was trained in machine learning.

My question is: Why do we transform X_test without fitting while fitting X_train at the bottom of the code?

hit = pd.read_csv("./xxx/xxx.csv")
df = hit.copy()
df = df.dropna()
y = df["Salary"]
X_ = df.drop(["Salary","League","Division","NewLeague"],axis=1).astype("float64")
dms = pd.get_dummies(df[["League","Division","NewLeague"]])
X = pd.concat([X_ , dms[["League_N","Division_W","NewLeague_N"]]],axis=1)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

CodePudding user response:

Because your X_train is reference for your training, if you fit on your test data it leaks information on how you transform your train data.

I like to think that I should never use in any way the test data except at the end of the model training for evaluation, so the test data shouldn't be involved in any fitting, scaler or model

But don"t worry, X_train should have the same distribution as X_test so it will work...

  • Related