Home > Mobile >  OneHotEncoder failing after combining dataframes
OneHotEncoder failing after combining dataframes

Time:11-25

I have a model that runs successfully.

When I tried to predict using it, it was failing due to the fact that after OneHotEncoding, the test set had more columns than the train.

After some reading I found where I need to concat the two df's first, OneHotEncode, then split them apart.

  1. Added a 'temp' column to the train data set with value 'train'.
  2. Added a 'temp' column to the test data set with value 'test'.
  3. This is so that I can split the df apart later using boolean indexing like this:
X = temp_df[temp_df['temp'] == 'train']
X2 = temp_df[temp_df['temp'] == 'test']
  1. Vertically concat the two df's.
  2. Verify the shape of the new combined df.
  3. Change all columns to type 'category' except 'temp', which is object:
basin                    category
region                   category
lga                      category
extraction_type_class    category
management               category
quality_group            category
quantity                 category
source                   category
waterpoint_type          category
cluster                  category
temp                       object
  1. Now I am simply trying to OneHotEncode like I did before. I choose only categorical columns:
cat_ix = temp_df.select_dtypes(include=['category']).columns
  1. And I try to apply with:
ct = ColumnTransformer([('o', OneHotEncoder(), cat_ix)], remainder='passthrough')
temp_df = ct.fit_transform(temp_df)

It fails on the temp_df = ct.fit_transform(temp_df) line.

These identical steps worked perfectly before I added the temp column and concat'd the two df's.

The exact error:

Traceback (most recent call last):
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 778, in _hstack
    converted_Xs = [
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 779, in <listcomp>
    check_array(X, accept_sparse=True, force_all_finite=False)
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\utils\validation.py", line 738, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
    return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: 'train'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 783, in _hstack
    raise ValueError(
ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.

Why is it complaining about 'train'? That is in the 'temp' column which is being excluded.

CodePudding user response:

Note that the traceback doesn't reference OneHotEncoder, it's all the ColumnTransformer. You're trying to pass through the temp column, which gets tacked onto the one-hot-encoded sparse matrix in the method _hstack, and the second error message is the more relevant one. It cannot stack a string-type array onto a numeric sparse array (which leads to the first error message).

If the sparse matrix isn't too large, you can just force it to be dense by using sparse_threshold=0 in the ColumnTransformer or sparse=False in the OneHotEncoder. If it is too large for memory (or you'd prefer the sparse matrices), you could use a 0/1 indicator for the train/test split instead of the strings "train", "test".

  • Related