when I use Label Encoder to label the categorical data, it spits out all the input numbers but in ra-CodePudding

I have an input of a ton of different numbers in a csv file. it has 569 rows and 32 columns. the second column has an input of either M or B. i want an array of M's being switched to 1's and b's 0's, and all the numbers in order from top to bottom.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)`

gives me

array([506, 375, 361, 533, 216, 516, 328, 498, 485, 534, 330, 473, 462,
        30, 530, 524, 306, 497, 202, 154, 300, 234, 448, 185, 426, 478,
       520, 175, 444, 258, 439, 526, 336, 495, 513, 352, 458,  24,   0,
       484, 136, 429, 471, 465, 431, 396, 173, 504, 319, 100,  53, 276,
       172, 266, 252, 119, 318, 491,  31, 383, 273, 224, 496, 339, 467,
       376, 402,  65, 502,  37,  56, 489, 523, 466, 198,  36, 141, 492,
       450, 257, 371, 459, 476, 398, 260, 350,  92, 411, 409, 335, 151,
        89,  27,  60, 309, 328, 167, 380, 360, 416, 170, 418,  94, 339,
       187, 528, 390, 139, 440, 369, 333, 337, 488, 383, 460, 345, 225,
       481, 518,  19, 343, 331, 271, 270, 208, 138, 256,  50, 235, 332,
       407, 272, 204, 127, 199, 280, 164,  75, 137,  82, 293, 294, 284,
       287,  77, 470, 466, 404, 217, 115,  40, 532, 519,  79, 352, 289,
       229,  11, 255, 218, 266,  28, 406, 389, 397,  17, 222, 146, 412,
        38,  78, 166, 456, 160,  22, 247, 500, 424,   5, 161, 282, 521,
       351, 177, 438, 220, 103, 131,  54,  33, 531,  93,  52, 511, 356,
       104, 414,  51, 405, 457, 295, 251, 362, 490, 359, 437, 168,  43,
       486, 183,   6, 267,   2,  86, 464, 478, 327, 242, 312, 188, 357,
       297, 365, 480, 205,  16, 313, 326, 428, 515, 386, 130, 159, 324,
       298, 203, 355, 135, 238, 341,  47,  81, 522,  34, 201, 231, 142,
       503, 292, 242, 248,  45, 522, 285, 377, 264, 453, 507, 461, 510,
       268,  48, 184,  90, 190, 307, 192, 117, 126, 356,  20, 274, 372,
       296, 262,  14,  35,   4,  29,  98, 436,  68, 237, 479, 135,  39,
       452,  99, 113, 111, 366, 334, 427, 112, 101,  84,  66,  99, 215,
        80, 446, 233, 422, 246, 210,  74, 329, 240,  26,  55, 108,   3,
       311, 180, 286,  25,  15, 303, 477,   9, 435,  10, 347, 463, 265,
        95, 129, 120, 304, 263, 391, 394,  49, 174, 143, 195, 455, 442,
       213, 229, 364, 441, 388, 257, 241, 338, 283, 301, 363, 193,  87,
       475, 367, 419, 116, 140, 322, 132, 179, 291,   1,  67, 149,  43,
       191,  70, 209, 182,  46, 349, 430,  76, 354, 124, 223, 378, 509,
       125, 432, 527, 403, 158, 415, 494, 162,  91, 368,  62, 472, 196,
       225, 373, 461, 454, 128,  61, 219, 123, 393, 288, 508, 155, 152,
       245,  12, 227, 114, 293, 342,  96, 230, 254, 399, 408,  23, 165,
       320, 473, 421, 134, 317, 400, 177, 370, 271, 279, 433, 212, 129,
        13, 499, 417, 281, 321,  88, 477, 228, 148,  97,  69, 425, 261,
        85,  71, 308, 310, 387, 157, 181, 176, 449,  18, 302, 243, 163,
       214, 145,  83,  32, 144, 385, 177,  42, 250, 102, 517, 295, 253,
       512, 410, 344,  64, 314,  73, 487, 239, 249, 221, 395, 392, 226,
       197, 413, 109, 299, 469,  58, 382, 275, 205, 305,   7, 206, 118,
       178,  59, 468, 211, 420, 381, 346, 505, 186, 156, 518, 525, 370,
       501, 147, 483, 445,  21, 493, 122, 106,  81, 250, 392, 374, 348,
       379, 434, 200, 384, 401, 474, 353, 194, 153, 278, 232, 377, 236,
        35, 315, 189, 325, 451, 447, 482, 290, 462, 107,  41, 340, 105,
       171, 423, 259, 207,  57, 277,  44, 169, 150, 316,  72, 110, 269,
       358, 323,   8, 529, 443, 133,  63, 244, 514, 121], dtype=int64)

but the output should be

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0],
      dtype=int64)

Any ideas? :/

my full code is

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#importing the dataset 
dataset = pd.read_csv('C:\Machine learning\cancer_data\cancer.csv')
X = dataset.iloc[:, 1:31].values
Y = dataset.iloc[:, 31].values

dataset.head()

print("Cancer data set dimensions : {}".format(dataset.shape))

dataset.groupby('diagnosis').size()

#Visualization of data
dataset.groupby('diagnosis').hist(figsize=(12, 12))

dataset.isnull().sum()
dataset.isna().sum()

dataframe = pd.DataFrame(Y)
#Encoding categorical data values 
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)


# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)


#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

CodePudding user response：

There is some confusion with the problem you want to solve and the code you provided as an attempt to solve it. Your problem is of categorical encoding matter, but the code is about scaling numerical features. Is it a column in X that you try to encode or are you talking about Y?

Considering the information you provided, there is a second column with labels 'M' and 'B' that you want to encode to 1 and 0 respectively. Basically, to achieve that you may try:

# second column of some DataFrame
dataset[1] = dataset[1].astype('category').cat.codes

If you actually meant Y from the beginning, then you should check it out, because you probably did it right.