Machine Learning: getting a Dataframe after a OneHotEncoder-CodePudding

I have been stacked on how do I convert back the result of a OneHotEnocder to a DataFrame.The Idea that I have separated numeric columns from categorical columns as follows:

feats = df.drop(["Transported"], axis=1)  
target = df["Transported"]

from sklearn.model_selection import train_test_split

 X_train, X_test, y_train, y_test = train_test_split(feats, target, test_size = 0.2, 
 random_state=42)

here after doing the split, I needed to separet the num from cat for training set and i did this:

num_train = X_train.select_dtypes(include=['float64', 'int64'])
cat_train = X_train.select_dtypes(include=['object'])
num_test = X_test.select_dtypes(include=['float64', 'int64'])
cat_test = X_test.select_dtypes(include=['object'])

After this I did the the Simple imputer and it worked.

imputer_median = SimpleImputer(missing_values=np.nan, strategy='median')
imputer_most_frequent = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

num = ["Age", "RoomService", "FoodCourt", "ShoppingMall","Spa","VRDeck"]
num_train.loc[:,num] = imputer_median.fit_transform(num_train[num])
num_test.loc[:,num] = imputer_median.transform(num_test[num])

cat = ["HomePlanet", "CryoSleep", "Destination","VIP"]
cat_train.loc[:,cat] = imputer_most_frequent.fit_transform(cat_train[cat])
cat_test.loc[:,cat] = imputer_most_frequent.transform(cat_test[cat])

and this the head of the cat_train:

cat_train.head()
     HomePlanet CryoSleep   Destination VIP
2333    Earth   False   TRAPPIST-1e False
2589    Earth   False   TRAPPIST-1e False
8302    Europa  True    55 Cancri e False
8177    Mars    False   TRAPPIST-1e False
 500    Europa  True    55 Cancri e False

But, after this I needed to apply the OneHotEncoder just like this:

from sklearn.preprocessing import OneHotEncoder
oneh = OneHotEncoder( drop='first',sparse=False)

cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])

And I got this error:

shape mismatch: value array of shape (6954,6) could not be broadcast to indexing result 
of shape (6954,4)

I tried several ways, but everytime I could not succeed to have a DataFrame back after the OneHotEncoder. Please help me out, I am stacked on this and I cannot continue the rest of the work. Thanks in advance

here is the full traceback error:

ValueError                                Traceback (most recent 
call last)
~\AppData\Local\Temp\ipykernel_16200\2252764984.py in <module>
  3 oneh = OneHotEncoder( drop='first',sparse=False)
  4 
----> 5 cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
  6 cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])

~\anaconda3\lib\site-packages\pandas\core\indexing.py in 
__setitem__(self, key, value)
714 
715         iloc = self if self.name == "iloc" else self.obj.iloc
--> 716         iloc._setitem_with_indexer(indexer, value, 
self.name)
717 
718     def _validate_key(self, key, axis: int):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in 
_setitem_with_indexer(self, indexer, value, name)

1691 self._setitem_with_indexer_split_path(indexer, value, name) 1692 else: -> 1693 self._setitem_single_block(indexer, value, name) 1694 1695 def _setitem_with_indexer_split_path(self, indexer, value, name: str):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in 
_setitem_single_block(self, indexer, value, name)
1941 
1942         # actually do the set
-> 1943         self.obj._mgr = 
self.obj._mgr.setitem(indexer=indexer, value=value)
 1944         self.obj._maybe_update_cacher(clear=True, 
inplace=True)
 1945 

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in 
setitem(self, indexer, value)
335         For SingleBlockManager, this backs s[indexer] = value
336         """
--> 337         return self.apply("setitem", indexer=indexer, 
value=value)
338 
339     def putmask(self, mask, new, align: bool = True):

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in 
apply(self, f, align_keys, ignore_failures, **kwargs)
302                     applied = b.apply(f, **kwargs)
303                 else:
--> 304                     applied = getattr(b, f)(**kwargs)
305             except (TypeError, NotImplementedError):
306                 if not ignore_failures:

~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in 
setitem(self, indexer, value)
957         else:
958             value = setitem_datetimelike_compat(values, 
len(values[indexer]), value)
--> 959             values[indexer] = value
960 
961         return self

ValueError: shape mismatch: value array of shape (6954,6) could not 
be broadcast to indexing result of shape (6954,4)

I tried this time the next move:

from sklearn.preprocessing import OneHotEncoder
oneh = OneHotEncoder(handle_unknown='ignore')

cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
cat_test.loc[:,cat] = oneh.transform(cat_test)

and I got this dataframe, but this is not what I am looking for:

HomePlanet  CryoSleep   Destination VIP
2333    (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ...   (0, 
0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ...   (0, 0)\t1.0\n (0, 
3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 
7)\t1.0\n ...
2589    (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ...   (0, 
0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ...   (0, 0)\t1.0\n (0, 
3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 
7)\t1.0\n ...

I also used Columntransformer; but It's not getting me back the dataframe I want to(i mean the dataframe with the original columns used before the onehotencoder (look above the cat_train)) this is the steps I did:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
    transformers=[("OneHotEncoder", OneHotEncoder(drop='first', 
sparse=False), cat)],
    remainder='passthrough'
)

cat_train = ct.fit_transform(cat_train)
cat_test = ct.transform(cat_test)

cat_train = pd.DataFrame(cat_train, columns=ct.get_feature_names())
cat_test = pd.DataFrame(cat_test, columns=ct.get_feature_names())

cat_train

and the cat_train.head() I got is :

OneHotEncoder__x0_Europa    OneHotEncoder__x0_Mars  OneHotEncoder__x1_True  OneHotEncoder__x2_PSO J318.5-22 OneHotEncoder__x2_TRAPPIST-1e   OneHotEncoder__x3_True

0 0.0 0.0 0.0 0.0 1.0 0.0 1 0.0 0.0 0.0 0.0 1.0 0.0 2 1.0 0.0 1.0 0.0 0.0

this is weird because next I need to concatenat the cat_train with num_train and also for the test, and I done this , alot of NAN values will appears, wherease I already imputed all the nan values before. Any Idea?

CodePudding user response：

The first error is because you try to assign the one-hot-encoded data, which has more columns than the original, back to the same original columns. You need to instead add these dummy columns and delete the original ones. Anyway, applying fit_transform to both train and test (assuming the repeated train row is a typo) is a bad idea.

The second error appears to be due to the one-hot-encoded data being sparse. You can specify sparse=False in the OneHotEncoder to fix that, but then probably you'll have the same issue as above.

The best thing to do is to use a ColumnTransformer; it would handle all the concatenation for you.

CodePudding user response：

I succeeded to find the solution. In fact, I was looking to get back the original(since I had 4 columns so I thought I should get these columns back) columns as they were before the OneHotEnoder, which is not generally POSSIBLE. In my case I have ,for each cat_train columns, a different modality(more than one) so the result after a OneHotEncoder must be a more columns than before. So, and based on this, I ve regenerated the code as follow:

feats = df.drop(["Transported"], axis=1)  
target = df["Transported"]

X_train, X_test, y_train, y_test = train_test_split(feats, target, 
test_size = 0.2, random_state=42)

Separate numeric columns from categorical columns

import numpy as np
num_train = X_train.select_dtypes(include=[np.number])
cat_train = X_train.select_dtypes(exclude=[np.number])
num_test = X_test.select_dtypes(include=[np.number])
cat_test = X_test.select_dtypes(exclude=[np.number])

Fill in missing values

num_imp = SimpleImputer(strategy='median')
num_train = num_imp.fit_transform(num_train)
num_test = num_imp.transform(num_test)
cat_imp = SimpleImputer(strategy='most_frequent')
cat_train = cat_imp.fit_transform(cat_train)
cat_test = cat_imp.transform(cat_test)

Encode categorical variables

cat_enc = OneHotEncoder(handle_unknown='ignore')
cat_train = cat_enc.fit_transform(cat_train)
cat_test = cat_enc.transform(cat_test)

And Now the magic part; Reconstitute training and test sets

X_train = pd.concat([pd.DataFrame(num_train), 
pd.DataFrame(cat_train.toarray())], axis=1)

X_test = pd.concat([pd.DataFrame(num_test), 
pd.DataFrame(cat_test.toarray())], axis=1)

the dataframe is now as it should be

X_train.head()

    0   1   2   3   4   5   0   1   2   3   4   5   6   7   8   9
0   28.0    0.0 55.0    0.0 656.0   0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0  0.0
1   17.0    0.0 1195.0  31.0    0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
2   28.0    0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0
3   20.0    0.0 2.0 289.0   976.0   0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
4   36.0    0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0