Home > database >  Python Dataframe column name issues with StrEnum and sklearn regression
Python Dataframe column name issues with StrEnum and sklearn regression

Time:07-20

I have been seeing many warnings from sklearn type e.g:

FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['MyNewNames']. An error will be raised in 1.2.

while using StrEnum for the names of my features in regression anylises. I wrote the following code to ilustrate the case:

from enum import auto
from strenum import StrEnum
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression


class MyNewNames(StrEnum):
   MyCrim = auto()
   MyZN =  auto()
   MyIndus = auto()
        
class ComputeRegression():
  
   def SetUP(self,df:pd.DataFrame):
      variables = [[MyNewNames.MyCrim,MyNewNames.MyZN],[MyNewNames.MyCrim,MyNewNames.MyIndus]]

      for value in variables:
        dX = df[value]    
        dY = df["AGE"]
        self.ComputeRegression(dX,dY)
        
   def ComputeRegression(self,dX,dY):
        
      model = linear_model.LinearRegression()
      model.fit(X = dX, y = dY)
      predicted = model.predict(dX) # I see warnings when calling this line
      print(predicted) 

boston = datasets.load_boston()

df = pd.DataFrame(data= boston['data'], columns= boston['feature_names'])
df[MyNewNames.MyCrim] = df["CRIM"]
df[MyNewNames.MyZN] = df["ZN"]
df[MyNewNames.MyIndus] = df["INDUS"]

cr = ComputeRegression()
cr.SetUP(df)

It seems the issue is when I create a new column in the dataframe using the StrEnum, since when I change the following part of the code to:

df[MyNewNames.MyCrim.value] = df["CRIM"]
df[MyNewNames.MyZN.value] = df["ZN"]
df[MyNewNames.MyIndus.value] = df["INDUS"]

The warnings disapear. Can anyone explain to me why creating a new dataframe column with an enum "MyNewNames.Something" causes issues with the naming of the df column, while accessing a column like df[MyNewNames.MyIndus] is not a problem ?

Thanks!

CodePudding user response:

As of today, from my point of view, pandas.columns must be of python native type string because they are immutable and also an arbitrary choice of authors.

When you pass MyNewnames.Soemthing the object is of type enum. Like so:

enter image description here

CodePudding user response:

The likely cause is that sklearn is doing a type check instead of an isinstance check for column names. That is:

if type(column_name) is str

instead of

if isinstance(column_name, str)
  • Related