Home > database >  Pandas column transform into 1D array
Pandas column transform into 1D array

Time:08-24

I'm trying to preprocess the data frame for the decision tree creation but getting the error about the dimension:

y should be a 1d array, got an array of shape (1, 1460) instead.

I've tried to use values = df_train[col].unique().flatten() but the error was the same.

The code is the following:

for col in df_train.columns:
    values = df_train[col].unique()
    new_col = preprocessing.LabelEncoder()
    new_col.fit([values])
    col_num = df_train.columns.get_loc(col)
    df_train[:,col_num] = new_col.transform(df_train[:,col_num]) 

Example of columns:

Colour Area
Red 230
Yellow 400

Thank you!

CodePudding user response:

Here's your code fixed (pass 'values' without an outside list, also use .iloc for integer indexing):

import pandas as pd
from sklearn import preprocessing

df_train = pd.DataFrame({'A': ['cat', 'dog', 'cat', 'cactus'],
                  'B': ['gray', 'black', 'black', 'green']})
print(df_train)

        A      B
0     cat   gray
1     dog  black
2     cat  black
3  cactus  green
for col in df_train.columns:
    values = df_train[col].unique()
    new_col = preprocessing.LabelEncoder()
    new_col.fit(values)
    col_num = df_train.columns.get_loc(col)
    df_train.iloc[:,col_num] = new_col.transform(df_train.iloc[:,col_num])
    
print(df_train)

   A  B
0  1  2
1  2  0
2  1  0
3  0  1

But this is way too complicated. It's better to use OrdinalEncoder:

Proper way

ord_enc = preprocessing.OrdinalEncoder()
X_train = ord_enc.fit_transform(df_train)
print(X_train)

[[1. 2.]
 [2. 0.]
 [1. 0.]
 [0. 1.]]

OrdinalEncoder is designed for features transformation, unlike LabelEncoder, which is for target transformation.

CodePudding user response:

I can't say for sure because you have not shared the full output of the code, but I think, if you take transpose of the result of df_train[col].unique() which will convert it from [1,1046) to (1046, 1). It is my guess that 1046 should be your number of samples and 1 should be number of columns

  • Related