OneHotEncoder failing after combining dataframes-CodePudding

I have a model that runs successfully.

When I tried to predict using it, it was failing due to the fact that after OneHotEncoding, the test set had more columns than the train.

After some reading I found where I need to concat the two df's first, OneHotEncode, then split them apart.

Added a 'temp' column to the train data set with value 'train'.
Added a 'temp' column to the test data set with value 'test'.
This is so that I can split the df apart later using boolean indexing like this:

X = temp_df[temp_df['temp'] == 'train']
X2 = temp_df[temp_df['temp'] == 'test']

Vertically concat the two df's.
Verify the shape of the new combined df.
Change all columns to type 'category' except 'temp', which is object:

basin                    category
region                   category
lga                      category
extraction_type_class    category
management               category
quality_group            category
quantity                 category
source                   category
waterpoint_type          category
cluster                  category
temp                       object

Now I am simply trying to OneHotEncode like I did before. I choose only categorical columns:

cat_ix = temp_df.select_dtypes(include=['category']).columns

And I try to apply with:

ct = ColumnTransformer([('o', OneHotEncoder(), cat_ix)], remainder='passthrough')
temp_df = ct.fit_transform(temp_df)

It fails on the temp_df = ct.fit_transform(temp_df) line.

These identical steps worked perfectly before I added the temp column and concat'd the two df's.

The exact error:

Traceback (most recent call last):
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 778, in _hstack
    converted_Xs = [
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 779, in <listcomp>
    check_array(X, accept_sparse=True, force_all_finite=False)
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\utils\validation.py", line 738, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
    return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: 'train'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 783, in _hstack
    raise ValueError(
ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.

Why is it complaining about 'train'? That is in the 'temp' column which is being excluded.

CodePudding user response：

Note that the traceback doesn't reference OneHotEncoder, it's all the ColumnTransformer. You're trying to pass through the temp column, which gets tacked onto the one-hot-encoded sparse matrix in the method _hstack, and the second error message is the more relevant one. It cannot stack a string-type array onto a numeric sparse array (which leads to the first error message).

If the sparse matrix isn't too large, you can just force it to be dense by using sparse_threshold=0 in the ColumnTransformer or sparse=False in the OneHotEncoder. If it is too large for memory (or you'd prefer the sparse matrices), you could use a 0/1 indicator for the train/test split instead of the strings "train", "test".