Why can't I run this logistic regression script with a 1d array in Python?-CodePudding

I'm trying to figure out which variables affect the toAnalyse variable. For this I use the LogisticRegression method. When I run the code below, I get the following error:

Code:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib import rcParams
from sklearn.linear_model import LogisticRegression

rcParams['figure.figsize'] = 14, 7
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False

data = pd.read_csv('file.txt', sep=",")

df = pd.concat([
    pd.DataFrame(data, columns=data.columns),
    pd.DataFrame(data, columns=['toAnalyse'])
], axis=1)

X = df.drop(['notimportant', 'test', 'toAnalyse'], axis=1)
y = df['toAnalyse']
#y.drop(y.columns[0], axis=1, inplace=True)   <----------------- From 2 to 0 variables when running this?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

The error:

ValueError: y should be a 1d array, got an array of shape (258631, 2) instead.

That seems to be correct, because when I print y.info() I get back:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344842 entries, 0 to 344841
Data columns (total 2 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   toAnalyse          343480 non-null  float64
 1   toAnalyse          343480 non-null  float64

The toAnalyse variable thus appears to be in y twice. Okay, then I want to remove the first (based on the index) so that I am left with a 1d row. However, when I use y.drop(y.columns[0], axis=1, inplace=True) , I get the error that there are no more variables in it at all:

ValueError: y should be a 1d array, got an array of shape (258631, 0) instead.

What's going on, and how can I run this with a 1d array?

CodePudding user response：

It looks like after

df = pd.concat([
    pd.DataFrame(data, columns=data.columns),
    pd.DataFrame(data, columns=['toAnalyse'])
], axis=1)

you have the column 'toAnalyse' in your dataframe twice. This is the reason for the wrong shape of y in the first place. As drop looks for the column name, you end up with no columns after your drop statement.

To fix that I would simply remove the statement with df. data seems to contain all you need, so

X = data.drop(['notimportant', 'test', 'toAnalyse'], axis=1)
y = data['toAnalyse']

should work.