I have been stacked on how do I convert back the result of a OneHotEnocder to a DataFrame.The Idea that I have separated numeric columns from categorical columns as follows:
feats = df.drop(["Transported"], axis=1)
target = df["Transported"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(feats, target, test_size = 0.2,
random_state=42)
here after doing the split, I needed to separet the num from cat for training set and i did this:
num_train = X_train.select_dtypes(include=['float64', 'int64'])
cat_train = X_train.select_dtypes(include=['object'])
num_test = X_test.select_dtypes(include=['float64', 'int64'])
cat_test = X_test.select_dtypes(include=['object'])
After this I did the the Simple imputer and it worked.
imputer_median = SimpleImputer(missing_values=np.nan, strategy='median')
imputer_most_frequent = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
num = ["Age", "RoomService", "FoodCourt", "ShoppingMall","Spa","VRDeck"]
num_train.loc[:,num] = imputer_median.fit_transform(num_train[num])
num_test.loc[:,num] = imputer_median.transform(num_test[num])
cat = ["HomePlanet", "CryoSleep", "Destination","VIP"]
cat_train.loc[:,cat] = imputer_most_frequent.fit_transform(cat_train[cat])
cat_test.loc[:,cat] = imputer_most_frequent.transform(cat_test[cat])
and this the head of the cat_train:
cat_train.head()
HomePlanet CryoSleep Destination VIP
2333 Earth False TRAPPIST-1e False
2589 Earth False TRAPPIST-1e False
8302 Europa True 55 Cancri e False
8177 Mars False TRAPPIST-1e False
500 Europa True 55 Cancri e False
But, after this I needed to apply the OneHotEncoder just like this:
from sklearn.preprocessing import OneHotEncoder
oneh = OneHotEncoder( drop='first',sparse=False)
cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
And I got this error:
shape mismatch: value array of shape (6954,6) could not be broadcast to indexing result
of shape (6954,4)
I tried several ways, but everytime I could not succeed to have a DataFrame back after the OneHotEncoder. Please help me out, I am stacked on this and I cannot continue the rest of the work. Thanks in advance
here is the full traceback error:
ValueError Traceback (most recent
call last)
~\AppData\Local\Temp\ipykernel_16200\2252764984.py in <module>
3 oneh = OneHotEncoder( drop='first',sparse=False)
4
----> 5 cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
6 cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
~\anaconda3\lib\site-packages\pandas\core\indexing.py in
__setitem__(self, key, value)
714
715 iloc = self if self.name == "iloc" else self.obj.iloc
--> 716 iloc._setitem_with_indexer(indexer, value,
self.name)
717
718 def _validate_key(self, key, axis: int):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in
_setitem_with_indexer(self, indexer, value, name)
1691 self._setitem_with_indexer_split_path(indexer, value, name) 1692 else: -> 1693 self._setitem_single_block(indexer, value, name) 1694 1695 def _setitem_with_indexer_split_path(self, indexer, value, name: str):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in
_setitem_single_block(self, indexer, value, name)
1941
1942 # actually do the set
-> 1943 self.obj._mgr =
self.obj._mgr.setitem(indexer=indexer, value=value)
1944 self.obj._maybe_update_cacher(clear=True,
inplace=True)
1945
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in
setitem(self, indexer, value)
335 For SingleBlockManager, this backs s[indexer] = value
336 """
--> 337 return self.apply("setitem", indexer=indexer,
value=value)
338
339 def putmask(self, mask, new, align: bool = True):
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in
apply(self, f, align_keys, ignore_failures, **kwargs)
302 applied = b.apply(f, **kwargs)
303 else:
--> 304 applied = getattr(b, f)(**kwargs)
305 except (TypeError, NotImplementedError):
306 if not ignore_failures:
~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in
setitem(self, indexer, value)
957 else:
958 value = setitem_datetimelike_compat(values,
len(values[indexer]), value)
--> 959 values[indexer] = value
960
961 return self
ValueError: shape mismatch: value array of shape (6954,6) could not
be broadcast to indexing result of shape (6954,4)
I tried this time the next move:
from sklearn.preprocessing import OneHotEncoder
oneh = OneHotEncoder(handle_unknown='ignore')
cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
cat_test.loc[:,cat] = oneh.transform(cat_test)
and I got this dataframe, but this is not what I am looking for:
HomePlanet CryoSleep Destination VIP
2333 (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ... (0,
0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0,
3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0, 3)\t1.0\n (0,
7)\t1.0\n ...
2589 (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ... (0,
0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0,
3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0, 3)\t1.0\n (0,
7)\t1.0\n ...
I also used Columntransformer; but It's not getting me back the dataframe I want to(i mean the dataframe with the original columns used before the onehotencoder (look above the cat_train)) this is the steps I did:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
transformers=[("OneHotEncoder", OneHotEncoder(drop='first',
sparse=False), cat)],
remainder='passthrough'
)
cat_train = ct.fit_transform(cat_train)
cat_test = ct.transform(cat_test)
cat_train = pd.DataFrame(cat_train, columns=ct.get_feature_names())
cat_test = pd.DataFrame(cat_test, columns=ct.get_feature_names())
cat_train
and the cat_train.head() I got is :
OneHotEncoder__x0_Europa OneHotEncoder__x0_Mars OneHotEncoder__x1_True OneHotEncoder__x2_PSO J318.5-22 OneHotEncoder__x2_TRAPPIST-1e OneHotEncoder__x3_True
0 0.0 0.0 0.0 0.0 1.0 0.0 1 0.0 0.0 0.0 0.0 1.0 0.0 2 1.0 0.0 1.0 0.0 0.0
this is weird because next I need to concatenat the cat_train with num_train and also for the test, and I done this , alot of NAN values will appears, wherease I already imputed all the nan values before. Any Idea?
CodePudding user response:
The first error is because you try to assign the one-hot-encoded data, which has more columns than the original, back to the same original columns. You need to instead add these dummy columns and delete the original ones. Anyway, applying fit_transform
to both train and test (assuming the repeated train
row is a typo) is a bad idea.
The second error appears to be due to the one-hot-encoded data being sparse. You can specify sparse=False
in the OneHotEncoder
to fix that, but then probably you'll have the same issue as above.
The best thing to do is to use a ColumnTransformer
; it would handle all the concatenation for you.
CodePudding user response:
I succeeded to find the solution. In fact, I was looking to get back the original(since I had 4 columns so I thought I should get these columns back) columns as they were before the OneHotEnoder, which is not generally POSSIBLE. In my case I have ,for each cat_train columns, a different modality(more than one) so the result after a OneHotEncoder must be a more columns than before. So, and based on this, I ve regenerated the code as follow:
feats = df.drop(["Transported"], axis=1)
target = df["Transported"]
X_train, X_test, y_train, y_test = train_test_split(feats, target,
test_size = 0.2, random_state=42)
Separate numeric columns from categorical columns
import numpy as np
num_train = X_train.select_dtypes(include=[np.number])
cat_train = X_train.select_dtypes(exclude=[np.number])
num_test = X_test.select_dtypes(include=[np.number])
cat_test = X_test.select_dtypes(exclude=[np.number])
Fill in missing values
num_imp = SimpleImputer(strategy='median')
num_train = num_imp.fit_transform(num_train)
num_test = num_imp.transform(num_test)
cat_imp = SimpleImputer(strategy='most_frequent')
cat_train = cat_imp.fit_transform(cat_train)
cat_test = cat_imp.transform(cat_test)
Encode categorical variables
cat_enc = OneHotEncoder(handle_unknown='ignore')
cat_train = cat_enc.fit_transform(cat_train)
cat_test = cat_enc.transform(cat_test)
And Now the magic part; Reconstitute training and test sets
X_train = pd.concat([pd.DataFrame(num_train),
pd.DataFrame(cat_train.toarray())], axis=1)
X_test = pd.concat([pd.DataFrame(num_test),
pd.DataFrame(cat_test.toarray())], axis=1)
the dataframe is now as it should be
X_train.head()
0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9
0 28.0 0.0 55.0 0.0 656.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
1 17.0 0.0 1195.0 31.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
2 28.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0
3 20.0 0.0 2.0 289.0 976.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
4 36.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0