"IndexError: tuple index out of range" on train_test_split train data once attempting to f-CodePudding

I was trying to pre-process my data using normalization.

# preprocessing
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tensorflow.keras import layers
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

np.set_printoptions(precision=3, suppress=True)
btc_data = pd.read_csv(
    "output.csv",
    names=["Time", "Open"])

ct = make_column_transformer(
    (MinMaxScaler(), ["Time", "Open"]),
    (OneHotEncoder(handle_unknown="ignore"), ["Time", "Open"])
)

X_btc = btc_data["Time"]
y_btc = btc_data["Open"]

X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)

ct.fit(X_train)
X_train_normal = ct.transform(X_train)
X_test_normal = ct.transform(X_test)

The code runs on a Colab notebook. The dataset is from Kaple and it is modified to be full of Unix Timestamps and another column for prices of Bitcoin on open at those times. After spliting the data and creating a column transformer, I tried fitting the data. However, I get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-44-f73622372111> in <module>()
     27 print(X_train.shape)
     28 
---> 29 ct.fit(X_train)
     30 X_train_normal = ct.transform(X_train)
     31 X_test_normal = ct.transform(X_test)

3 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
    387     :func:`_safe_indexing_column`.
    388     """
--> 389     n_columns = X.shape[1]
    390 
    391     key_dtype = _determine_key_type(key)

IndexError: tuple index out of range

I am wondering if it is a shape issue, but as a note, the X_train data is of shape (2020896,).

Is there something I have to do with my data to fix this error?

CodePudding user response：

You extracted the X_btc as a Pandas Series which is like 1D array, you need to extract DataFrame (2D array/matrix). Replace:

X_btc = btc_data["Time"]

with:

X_btc = btc_data[["Time"]]

to extract the DataFrame

Edit for the new error:

KeyError happens because this transformer:

ct = make_column_transformer(
    (MinMaxScaler(), ["Time", "Open"]),
    (OneHotEncoder(handle_unknown="ignore"), ["Time", "Open"])
)

You are using ["Time", "Open"] columns. However, the X_btc has no column "Open" (as you selected only column "Time"). The "Open" is the target label (y_btc) and you should not include it into X_btc. In that case, you can remove "Open" from make_column_transformer:

ct = make_column_transformer(
    (MinMaxScaler(), ["Time"]),
    (OneHotEncoder(handle_unknown="ignore"), ["Time"])
)