Any ideas why I could not use the log1p here? (Python / Jupyter Notebook)-CodePudding

#splitting dataset into train, val, and test into 60-20-20
features = df.drop('quality', axis = 1)
labels = df['quality']

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

X_train.to_csv('train_features.csv', index=False)
X_val.to_csv('val_features.csv', index=False)
X_test.to_csv('test_features.csv', index=False)

y_train.to_csv('train_labels.csv', index=False)
y_val.to_csv('val_labels.csv', index=False)
y_test.to_csv('test_labels.csv', index=False)

from sklearn.ensemble import RandomForestRegressor
rfModel = RandomForestRegressor(n_estimators=500)
rfModel.fit(X_train, np.log1p(y_train))
preds2 = rfModel.predict(X_test)

The y_train.info() is:

count     815
unique      2
top         0
freq      704
Name: quality, dtype: int64

The y_train.head() is:

635     0
908     0
1578    0
245     0
1451    1
Name: quality, dtype: category
Categories (2, int64): [0 < 1]

I am clueless about why it is error, I could use np.log1p(1e-99) with other but not this one. The error info is as below:

CodePudding user response：

The issue is that you attempt to apply the logarithm to an array-like object that is of type categorical rather than int (or np.int64 etc.).

In particular, the error can be produced by certain Pandas columns when a class label is encoded as a category rather than a mere integer. This is usually the case, we the class label of the endogenous variable is encoded as for y_train and revealed by your command y_train.head(). Here, is a minimum viable example how it pops up.

s = pd.Series([1,2,3,4], dtype="category")
np.log1p(s) # TypeError: Object with dtype category cannot perform the numpy op log1p

However, converting the type to float (or int etc.) solves the problem, e.g.

s = pd.Series([1,2,3,4], dtype="category")
s = s.astype(float)
np.log1p(s) # no error/warning