What's the mistake I am doing in this CV code-CodePudding

I am trying to do CV for my training and testing datasets. I am using LinearRegressor. However, when I run the code, I get the error below. But when I run the code on Decision Trees I don't get any errors and the code works. How to fix this? Is my code for the CV section correct? Thank you for your help.......................................................

Reference for the CV code: scikit-learn cross_validation over-fitting or under-fitting

data_set = pd.read_excel("NEW Collected Data for Preliminary Results Independant variables ONLY_NO AREA_NO_INFILL_DENSITY_no_printing_temperature.xlsx")
pd.set_option('max_columns', 35)
pd.set_option('max_rows', 300)
data_set.head(300)


X, y = data_set[[ "Part's Z-Height (mm)","Part's Solid Volume (cm^3)","Layer Height (mm)","Printing/Scanning Speed (mm/s)","Part's Orientation (Support's volume) (cm^3)"]], data_set [["Climate change (kg CO2 eq.)","Climate change, incl biogenic carbon (kg CO2 eq.)","Fine Particulate Matter Formation (kg PM2.5 eq.)","Fossil depletion (kg oil eq.)","Freshwater Consumption (m^3)","Freshwater ecotoxicity (kg 1,4-DB eq.)","Freshwater Eutrophication (kg P eq.)","Human toxicity, cancer (kg 1,4-DB eq.)","Human toxicity, non-cancer (kg 1,4-DB eq.)","Ionizing Radiation (Bq. C-60 eq. to air)","Land use (Annual crop eq. yr)","Marine ecotoxicity (kg 1,4-DB eq.)","Marine Eutrophication (kg N eq.)","Metal depletion (kg Cu eq.)","Photochemical Ozone Formation, Ecosystem (kg NOx eq.)","Photochemical Ozone Formation, Human Health (kg NOx eq.)","Stratospheric Ozone Depletion (kg CFC-11 eq.)","Terrestrial Acidification (kg SO2 eq.)","Terrestrial ecotoxicity (kg 1,4-DB eq.)"]]

   scaler = preprocessing.MinMaxScaler()
    names = data_set.columns
    d = scaler.fit_transform(data_set)
    scaled_df = pd.DataFrame(d, columns=names)
    X_normalized, y_for_normalized = scaled_df[[ "Part's Z-Height (mm)","Part's Solid Volume (cm^3)","Layer Height (mm)","Printing/Scanning Speed (mm/s)","Part's Orientation (Support's volume) (cm^3)"]], scaled_df [["Climate change (kg CO2 eq.)","Climate change, incl biogenic carbon (kg CO2 eq.)","Fine Particulate Matter Formation (kg PM2.5 eq.)","Fossil depletion (kg oil eq.)","Freshwater Consumption (m^3)","Freshwater ecotoxicity (kg 1,4-DB eq.)","Freshwater Eutrophication (kg P eq.)","Human toxicity, cancer (kg 1,4-DB eq.)","Human toxicity, non-cancer (kg 1,4-DB eq.)","Ionizing Radiation (Bq. C-60 eq. to air)","Land use (Annual crop eq. yr)","Marine ecotoxicity (kg 1,4-DB eq.)","Marine Eutrophication (kg N eq.)","Metal depletion (kg Cu eq.)","Photochemical Ozone Formation, Ecosystem (kg NOx eq.)","Photochemical Ozone Formation, Human Health (kg NOx eq.)","Stratospheric Ozone Depletion (kg CFC-11 eq.)","Terrestrial Acidification (kg SO2 eq.)","Terrestrial ecotoxicity (kg 1,4-DB eq.)"]]
    scaled_df.head(200)

Part's Z-Height (mm)    Part's Solid Volume (cm^3)  Layer Height (mm)   Printing/Scanning Speed (mm/s)  Part's Orientation (Support's volume) (cm^3)    Climate change (kg CO2 eq.) Climate change, incl biogenic carbon (kg CO2 eq.)   Fine Particulate Matter Formation (kg PM2.5 eq.)    Fossil depletion (kg oil eq.)   Freshwater Consumption (m^3)    Freshwater ecotoxicity (kg 1,4-DB eq.)  Freshwater Eutrophication (kg P eq.)    Human toxicity, cancer (kg 1,4-DB eq.)  Human toxicity, non-cancer (kg 1,4-DB eq.)  Ionizing Radiation (Bq. C-60 eq. to air)    Land use (Annual crop eq. yr)   Marine ecotoxicity (kg 1,4-DB eq.)  Marine Eutrophication (kg N eq.)    Metal depletion (kg Cu eq.) Photochemical Ozone Formation, Ecosystem (kg NOx eq.)   Photochemical Ozone Formation, Human Health (kg NOx eq.)    Stratospheric Ozone Depletion (kg CFC-11 eq.)   Terrestrial Acidification (kg SO2 eq.)  Terrestrial ecotoxicity (kg 1,4-DB eq.)
0   0.258287    0.005030    0.0 0.666667    0.040088    0.069825    0.056976    0.083205    0.010373    0.113808    0.104798    0.086400    0.110358    0.012836    0.091120    0.108676    0.090401    0.087426    0.125608    0.079028    0.080495    0.078380    0.082404    0.045040
1   0.258287    0.005030    0.2 0.666667    0.036597    0.041682    0.022880    0.074884    0.004841    0.045640    0.102285    0.082884    0.044202    0.005414    0.086700    0.105749    0.087161    0.084130    0.060373    0.072878    0.073529    0.074829    0.075438    0.018122
2   0.258287    0.009557    0.4 0.666667    0.031013    0.033310    0.012113    0.073035    0.003458    0.023401    0.102914    0.082494    0.022690    0.003231    0.086279    0.105749    0.086937    0.084130    0.039708    0.071341    0.071981    0.074698    0.073447    0.009856
3   0.258287    0.009054    0.6 0.666667    0.031013    0.029213    0.006954    0.072111    0.002766    0.012936    0.102914    0.082103    0.012524    0.001921    0.086069    0.105423    0.086602    0.084130    0.029579    0.070572    0.071207    0.074435    0.072452    0.005723
4   0.258287    0.010060    1.0 0.666667    0.031711    0.025650    0.001795    0.071803    0.003458    0.002180    0.103542    0.082884    0.002063    0.001048    0.086490    0.106074    0.087049    0.084542    0.019449    0.070572    0.071207    0.074961    0.072452    0.001908
5   0.258287    0.005030    0.0 0.000000    0.040088    0.074279    0.062360    0.084129    0.011065    0.125000    0.104798    0.086790    0.121114    0.014146    0.091330    0.108676    0.091519    0.087426    0.136143    0.080566    0.081269    0.078511    0.083400    0.049385
6   0.258287    0.038226    0.0 0.666667    0.040088    0.097791    0.074249    0.109091    0.038036    0.135174    0.129299    0.111788    0.132164    0.024625    0.116582    0.133725    0.116102    0.112970    0.154781    0.105166    0.106037    0.104419    0.108280    0.064222
7   0.137212    0.004527    0.0 0.666667    0.030314    0.058247    0.046433    0.076117    0.003458    0.095349    0.099144    0.080150    0.092382    0.008907    0.084806    0.102821    0.084702    0.081246    0.106159    0.072878    0.073529    0.072199    0.075438    0.035608
8   0.137212    0.004527    0.2 0.666667    0.029616    0.035269    0.017721    0.069954    0.000000    0.037355    0.098516    0.078197    0.036246    0.002794    0.082281    0.101520    0.082803    0.080010    0.051053    0.068266    0.068885    0.070489    0.070462    0.013247
9   0.137212    0.010060    0.4 0.666667    0.028918    0.031706    0.010543    0.072111    0.002766    0.020494    0.102285    0.081712    0.019891    0.002358    0.085438    0.104773    0.086043    0.083306    0.036467    0.070572    0.071207    0.073908    0.072452    0.008372
10  0.137212    0.010060    0.6 0.666667    0.028220    0.027431    0.005384    0.070878    0.001383    0.010320    0.101657    0.080931    0.010019    0.001484    0.084806    0.104448    0.085373    0.082894    0.026742    0.069803    0.070433    0.073251    0.071457    0.004345
11  0.137212    0.009557    1.0 0.666667    0.027522    0.022800    0.000000    0.069029    0.000000    0.000000    0.101029    0.080150    0.000000    0.000000    0.083754    0.103472    0.084367    0.081658    0.016613    0.068266    0.068885    0.072330    0.070462    0.000000
12  0.137212    0.004527    0.0 0.000000    0.030314    0.062879    0.052266    0.077042    0.004149    0.107122    0.099144    0.080541    0.103875    0.010217    0.085227    0.102821    0.085037    0.081658    0.117099    0.073647    0.074303    0.072462    0.076433    0.040165
13  0.137212    0.037723    0.0 0.666667    0.030314    0.085857    0.063257    0.102003    0.031120    0.116134    0.123645    0.105929    0.112568    0.020695    0.110269    0.127544    0.110515    0.106790    0.134522    0.098247    0.099071    0.097843    0.101314    0.053624
14  0.077118    0.004527    0.0 0.666667    0.054050    0.080335    0.064827    0.091217    0.018672    0.126453    0.111709    0.093821    0.122145    0.016766    0.098485    0.115833    0.098223    0.094842    0.139789    0.087485    0.088235    0.085876    0.090366    0.052777
15  0.077118    0.004527    0.0 0.000000    0.054050    0.085144    0.070884    0.092450    0.019364    0.138081    0.111709    0.094211    0.133638    0.018075    0.099116    0.116158    0.098223    0.094842    0.151135    0.088253    0.089009    0.086139    0.091361    0.057864
16  0.077118    0.004527    0.0 0.333333    0.054050    0.082472    0.067519    0.091834    0.019364    0.132267    0.111709    0.094211    0.127744    0.017639    0.098695    0.116158    0.098223    0.094842    0.144652    0.087485    0.088235    0.086007    0.091361    0.054684

     lin_regressor = LinearRegression()
    
    # pass the order of your polynomial here  
    poly = PolynomialFeatures(1)
    
    # convert to be used further to linear regression
    X_transform = poly.fit_transform(x_train)
    
    # fit this to Linear Regressor
    linear_regg=lin_regressor.fit(X_transform,y_train).

    import numpy as np
    from sklearn.metrics import SCORERS
    from sklearn.model_selection import KFold
    
    scorer = SCORERS['r2']
    
    cv = KFold(n_splits=5, random_state=0,shuffle=True)
    train_scores, test_scores = [], []
    
    for train, test in cv.split(X_normalized):
        X_transform2 = poly.fit_transform(X_normalized)
        OL=lin_regressor.fit(X_transform2.iloc[train], y_for_normalized.iloc[train])
        tr_21 = OL.score(X_train, y_train)
        ts_21 = OL.score(X_test, y_test)
        print ("Train score:", tr_21) # from documentation .score returns r^2
        print ("Test score:", ts_21)   # from documentation .score returns r^2
        
        train_scores.append(tr_21)
        test_scores.append(ts_21)


    
    print ("The Mean for Train scores is:",(np.mean(train_scores)))
        
    print ("The Mean for Test scores is:",(np.mean(test_scores)))

Error message:


        --------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    /var/folders/mm/r4gnnwl948zclfyx12w803040000gn/T/ipykernel_73165/2276765730.py in <module>
         10 for train, test in cv.split(X_normalized):
         11     X_transform2 = poly.fit_transform(X_normalized)
    ---> 12     OL=lin_regressor.fit(X_transform2.iloc[train], y_for_normalized.iloc[train])
         13     tr_21 = OL.score(X_train, y_train)
         14     ts_21 = OL.score(X_test, y_test)
    
    AttributeError: 'numpy.ndarray' object has no attribute 'iloc'

Decision Trees

    new_model = DecisionTreeRegressor(max_depth=9,
                                      min_samples_split=10,random_state=0)


    import numpy as np
    from sklearn.metrics import SCORERS
    from sklearn.model_selection import KFold
     
    scorer = SCORERS['r2']
     
    cv = KFold(n_splits=5, random_state=0,shuffle=True)
    train_scores, test_scores = [], []
     
    for train, test in cv.split(X_normalized):
     
        OO=new_model.fit(X_normalized.iloc[train], y_for_normalized.iloc[train])
        tr_2 = OO.score(X_train, y_train)
        ts_2 = OO.score(X_test, y_test)
        print ("Train score:", tr_2) # from documentation .score returns r^2
        print ("Test score:", ts_2)   # from documentation .score returns r^2
         
        train_scores.append(tr_2)
        test_scores.append(ts_2)
     
         
         
    print ("The Mean for Train scores is:",(np.mean(train_scores)))
         
    print ("The Mean for Test scores is:",(np.mean(test_scores)))

Output

    Train score: 0.8960560474997927
    Test score: -0.15521696464773224
    Train score: 0.8852795454592853
    Test score: 0.17650772852710495
    Train score: 0.5825347735306872
    Test score: 0.34789159049344665
    Train score: 0.8549575808716975
    Test score: 0.7615265842042157
    Train score: 0.8340261480334055
    Test score: 0.14011826401728472
    The Mean for Train scores is: 0.8105708190789735
    The Mean for Test scores is: 0.2541654405188639

#Trial 1

import numpy as np
from sklearn.metrics import SCORERS
from sklearn.model_selection import KFold

scorer = SCORERS['r2']

cv = KFold(n_splits=5, random_state=0,shuffle=True)
train_scores, test_scores = [], []

for train, test in cv.split(X_normalized):
    X_transform2 = poly.fit_transform(X_normalized)
    OL=lin_regressor.fit(X_transform2[train], y_for_normalized[train])
    tr_21 = OL.score(X_train, y_train)
    ts_21 = OL.score(X_test, y_test)
    print ("Train score:", tr_21) # from documentation .score returns r^2
    print ("Test score:", ts_21)   # from documentation .score returns r^2
    
    train_scores.append(tr_21)
    test_scores.append(ts_21)


    
print ("The Mean for Train scores is:",(np.mean(train_scores)))
    
print ("The Mean for Test scores is:",(np.mean(test_scores)))

Error message:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/mm/r4gnnwl948zclfyx12w803040000gn/T/ipykernel_90924/12176184.py in <module>
     10 for train, test in cv.split(X_normalized):
     11     X_transform2 = poly.fit_transform(X_normalized)
---> 12     OL=lin_regressor.fit(X_transform2[train], y_for_normalized[train])
     13     tr_21 = OL.score(X_train, y_train)
     14     ts_21 = OL.score(X_test, y_test)

~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3462             if is_iterator(key):
   3463                 key = list(key)
-> 3464             indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
   3465 
   3466         # take() does not accept boolean indexers

~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1372                 if use_interval_msg:
   1373                     key = list(key)
-> 1374                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())

KeyError: "None of [Int64Index([ 0,  1,  3,  4,  5,  6,  9, 10, 11, 12, 14, 15, 17, 18, 19, 20, 21,\n            23, 25, 27, 28, 29, 31, 32, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,\n            44, 45, 46, 47, 48, 49, 50, 51, 52, 56, 57, 58, 59, 60, 61, 62, 63,\n            64, 65, 66, 67, 68, 69, 70, 71, 72, 74, 76, 77, 79, 80, 81, 82, 83,\n            84, 85, 87, 88, 89, 90, 91, 94, 96, 97, 98, 99],\n           dtype='int64')] are in the [columns]"

CodePudding user response：

Understanding

poly.fit_transform will return numpy.ndarray so here your X_normalized is being transformed from pandas.core.frame.DataFrame to numpy.ndarray.
But your y_for_normalized is still pandas.core.frame.DataFrame.
So in numpy.ndarray you pass indexes as numpy.ndarray[indexes] and for pandas.core.frame.DataFrame you will pass indexes in .iloc[indexes] respectively.

Solution

For X_transform2 use [] for getting data as it's numpy.ndarray
For y_for_normalized use .iloc[] as it's pandas.core.frame.DataFrame

Code

train_scores, test_scores = [], []

for train, test in cv.split(X_normalized):
    X_transform2 = poly.fit_transform(X_normalized)
    # [] for X_transform2, .iloc[] for y_for_normalized
    OL = lin_regressor.fit(X_transform2[train], y_for_normalized.iloc[train])
    tr_21 = OL.score(X_transform2[train], y_for_normalized.iloc[train])
    ts_21 = OL.score(X_transform2[test], y_for_normalized.iloc[test])
    print("Train score:", tr_21)  # from documentation .score returns r^2
    print("Test score:", ts_21)  # from documentation .score returns r^2

    train_scores.append(tr_21)
    test_scores.append(ts_21)


print("The Mean for Train scores is:", (np.mean(train_scores)))

print("The Mean for Test scores is:", (np.mean(test_scores)))

PS:

Don't know why are you using X_train, y_train and X_test, y_test in OL.score. It should be the dataset with index of train and test generated by cv. Same is reflected in a above code snippet.
- If you have X_train, y_train and X_test, y_test defined for specific reason then you are good to use.
Why are you using PolynomialFeatures() when you want all your feature to be of 1 degree which it's already is so it's making no difference to use PolynomialFeatures() for 1 degree.
Also check for deprecation warning for SCORER if you are using new version of sklearn.