I am trying to do CV for my training and testing datasets. I am using LinearRegressor. However, when I run the code, I get the error below. But when I run the code on Decision Trees I don't get any errors and the code works. How to fix this? Is my code for the CV section correct? Thank you for your help.......................................................
Reference for the CV code: scikit-learn cross_validation over-fitting or under-fitting
data_set = pd.read_excel("NEW Collected Data for Preliminary Results Independant variables ONLY_NO AREA_NO_INFILL_DENSITY_no_printing_temperature.xlsx")
pd.set_option('max_columns', 35)
pd.set_option('max_rows', 300)
data_set.head(300)
X, y = data_set[[ "Part's Z-Height (mm)","Part's Solid Volume (cm^3)","Layer Height (mm)","Printing/Scanning Speed (mm/s)","Part's Orientation (Support's volume) (cm^3)"]], data_set [["Climate change (kg CO2 eq.)","Climate change, incl biogenic carbon (kg CO2 eq.)","Fine Particulate Matter Formation (kg PM2.5 eq.)","Fossil depletion (kg oil eq.)","Freshwater Consumption (m^3)","Freshwater ecotoxicity (kg 1,4-DB eq.)","Freshwater Eutrophication (kg P eq.)","Human toxicity, cancer (kg 1,4-DB eq.)","Human toxicity, non-cancer (kg 1,4-DB eq.)","Ionizing Radiation (Bq. C-60 eq. to air)","Land use (Annual crop eq. yr)","Marine ecotoxicity (kg 1,4-DB eq.)","Marine Eutrophication (kg N eq.)","Metal depletion (kg Cu eq.)","Photochemical Ozone Formation, Ecosystem (kg NOx eq.)","Photochemical Ozone Formation, Human Health (kg NOx eq.)","Stratospheric Ozone Depletion (kg CFC-11 eq.)","Terrestrial Acidification (kg SO2 eq.)","Terrestrial ecotoxicity (kg 1,4-DB eq.)"]]
scaler = preprocessing.MinMaxScaler()
names = data_set.columns
d = scaler.fit_transform(data_set)
scaled_df = pd.DataFrame(d, columns=names)
X_normalized, y_for_normalized = scaled_df[[ "Part's Z-Height (mm)","Part's Solid Volume (cm^3)","Layer Height (mm)","Printing/Scanning Speed (mm/s)","Part's Orientation (Support's volume) (cm^3)"]], scaled_df [["Climate change (kg CO2 eq.)","Climate change, incl biogenic carbon (kg CO2 eq.)","Fine Particulate Matter Formation (kg PM2.5 eq.)","Fossil depletion (kg oil eq.)","Freshwater Consumption (m^3)","Freshwater ecotoxicity (kg 1,4-DB eq.)","Freshwater Eutrophication (kg P eq.)","Human toxicity, cancer (kg 1,4-DB eq.)","Human toxicity, non-cancer (kg 1,4-DB eq.)","Ionizing Radiation (Bq. C-60 eq. to air)","Land use (Annual crop eq. yr)","Marine ecotoxicity (kg 1,4-DB eq.)","Marine Eutrophication (kg N eq.)","Metal depletion (kg Cu eq.)","Photochemical Ozone Formation, Ecosystem (kg NOx eq.)","Photochemical Ozone Formation, Human Health (kg NOx eq.)","Stratospheric Ozone Depletion (kg CFC-11 eq.)","Terrestrial Acidification (kg SO2 eq.)","Terrestrial ecotoxicity (kg 1,4-DB eq.)"]]
scaled_df.head(200)
Part's Z-Height (mm) Part's Solid Volume (cm^3) Layer Height (mm) Printing/Scanning Speed (mm/s) Part's Orientation (Support's volume) (cm^3) Climate change (kg CO2 eq.) Climate change, incl biogenic carbon (kg CO2 eq.) Fine Particulate Matter Formation (kg PM2.5 eq.) Fossil depletion (kg oil eq.) Freshwater Consumption (m^3) Freshwater ecotoxicity (kg 1,4-DB eq.) Freshwater Eutrophication (kg P eq.) Human toxicity, cancer (kg 1,4-DB eq.) Human toxicity, non-cancer (kg 1,4-DB eq.) Ionizing Radiation (Bq. C-60 eq. to air) Land use (Annual crop eq. yr) Marine ecotoxicity (kg 1,4-DB eq.) Marine Eutrophication (kg N eq.) Metal depletion (kg Cu eq.) Photochemical Ozone Formation, Ecosystem (kg NOx eq.) Photochemical Ozone Formation, Human Health (kg NOx eq.) Stratospheric Ozone Depletion (kg CFC-11 eq.) Terrestrial Acidification (kg SO2 eq.) Terrestrial ecotoxicity (kg 1,4-DB eq.)
0 0.258287 0.005030 0.0 0.666667 0.040088 0.069825 0.056976 0.083205 0.010373 0.113808 0.104798 0.086400 0.110358 0.012836 0.091120 0.108676 0.090401 0.087426 0.125608 0.079028 0.080495 0.078380 0.082404 0.045040
1 0.258287 0.005030 0.2 0.666667 0.036597 0.041682 0.022880 0.074884 0.004841 0.045640 0.102285 0.082884 0.044202 0.005414 0.086700 0.105749 0.087161 0.084130 0.060373 0.072878 0.073529 0.074829 0.075438 0.018122
2 0.258287 0.009557 0.4 0.666667 0.031013 0.033310 0.012113 0.073035 0.003458 0.023401 0.102914 0.082494 0.022690 0.003231 0.086279 0.105749 0.086937 0.084130 0.039708 0.071341 0.071981 0.074698 0.073447 0.009856
3 0.258287 0.009054 0.6 0.666667 0.031013 0.029213 0.006954 0.072111 0.002766 0.012936 0.102914 0.082103 0.012524 0.001921 0.086069 0.105423 0.086602 0.084130 0.029579 0.070572 0.071207 0.074435 0.072452 0.005723
4 0.258287 0.010060 1.0 0.666667 0.031711 0.025650 0.001795 0.071803 0.003458 0.002180 0.103542 0.082884 0.002063 0.001048 0.086490 0.106074 0.087049 0.084542 0.019449 0.070572 0.071207 0.074961 0.072452 0.001908
5 0.258287 0.005030 0.0 0.000000 0.040088 0.074279 0.062360 0.084129 0.011065 0.125000 0.104798 0.086790 0.121114 0.014146 0.091330 0.108676 0.091519 0.087426 0.136143 0.080566 0.081269 0.078511 0.083400 0.049385
6 0.258287 0.038226 0.0 0.666667 0.040088 0.097791 0.074249 0.109091 0.038036 0.135174 0.129299 0.111788 0.132164 0.024625 0.116582 0.133725 0.116102 0.112970 0.154781 0.105166 0.106037 0.104419 0.108280 0.064222
7 0.137212 0.004527 0.0 0.666667 0.030314 0.058247 0.046433 0.076117 0.003458 0.095349 0.099144 0.080150 0.092382 0.008907 0.084806 0.102821 0.084702 0.081246 0.106159 0.072878 0.073529 0.072199 0.075438 0.035608
8 0.137212 0.004527 0.2 0.666667 0.029616 0.035269 0.017721 0.069954 0.000000 0.037355 0.098516 0.078197 0.036246 0.002794 0.082281 0.101520 0.082803 0.080010 0.051053 0.068266 0.068885 0.070489 0.070462 0.013247
9 0.137212 0.010060 0.4 0.666667 0.028918 0.031706 0.010543 0.072111 0.002766 0.020494 0.102285 0.081712 0.019891 0.002358 0.085438 0.104773 0.086043 0.083306 0.036467 0.070572 0.071207 0.073908 0.072452 0.008372
10 0.137212 0.010060 0.6 0.666667 0.028220 0.027431 0.005384 0.070878 0.001383 0.010320 0.101657 0.080931 0.010019 0.001484 0.084806 0.104448 0.085373 0.082894 0.026742 0.069803 0.070433 0.073251 0.071457 0.004345
11 0.137212 0.009557 1.0 0.666667 0.027522 0.022800 0.000000 0.069029 0.000000 0.000000 0.101029 0.080150 0.000000 0.000000 0.083754 0.103472 0.084367 0.081658 0.016613 0.068266 0.068885 0.072330 0.070462 0.000000
12 0.137212 0.004527 0.0 0.000000 0.030314 0.062879 0.052266 0.077042 0.004149 0.107122 0.099144 0.080541 0.103875 0.010217 0.085227 0.102821 0.085037 0.081658 0.117099 0.073647 0.074303 0.072462 0.076433 0.040165
13 0.137212 0.037723 0.0 0.666667 0.030314 0.085857 0.063257 0.102003 0.031120 0.116134 0.123645 0.105929 0.112568 0.020695 0.110269 0.127544 0.110515 0.106790 0.134522 0.098247 0.099071 0.097843 0.101314 0.053624
14 0.077118 0.004527 0.0 0.666667 0.054050 0.080335 0.064827 0.091217 0.018672 0.126453 0.111709 0.093821 0.122145 0.016766 0.098485 0.115833 0.098223 0.094842 0.139789 0.087485 0.088235 0.085876 0.090366 0.052777
15 0.077118 0.004527 0.0 0.000000 0.054050 0.085144 0.070884 0.092450 0.019364 0.138081 0.111709 0.094211 0.133638 0.018075 0.099116 0.116158 0.098223 0.094842 0.151135 0.088253 0.089009 0.086139 0.091361 0.057864
16 0.077118 0.004527 0.0 0.333333 0.054050 0.082472 0.067519 0.091834 0.019364 0.132267 0.111709 0.094211 0.127744 0.017639 0.098695 0.116158 0.098223 0.094842 0.144652 0.087485 0.088235 0.086007 0.091361 0.054684
lin_regressor = LinearRegression()
# pass the order of your polynomial here
poly = PolynomialFeatures(1)
# convert to be used further to linear regression
X_transform = poly.fit_transform(x_train)
# fit this to Linear Regressor
linear_regg=lin_regressor.fit(X_transform,y_train).
import numpy as np
from sklearn.metrics import SCORERS
from sklearn.model_selection import KFold
scorer = SCORERS['r2']
cv = KFold(n_splits=5, random_state=0,shuffle=True)
train_scores, test_scores = [], []
for train, test in cv.split(X_normalized):
X_transform2 = poly.fit_transform(X_normalized)
OL=lin_regressor.fit(X_transform2.iloc[train], y_for_normalized.iloc[train])
tr_21 = OL.score(X_train, y_train)
ts_21 = OL.score(X_test, y_test)
print ("Train score:", tr_21) # from documentation .score returns r^2
print ("Test score:", ts_21) # from documentation .score returns r^2
train_scores.append(tr_21)
test_scores.append(ts_21)
print ("The Mean for Train scores is:",(np.mean(train_scores)))
print ("The Mean for Test scores is:",(np.mean(test_scores)))
Error message:
--------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/mm/r4gnnwl948zclfyx12w803040000gn/T/ipykernel_73165/2276765730.py in <module>
10 for train, test in cv.split(X_normalized):
11 X_transform2 = poly.fit_transform(X_normalized)
---> 12 OL=lin_regressor.fit(X_transform2.iloc[train], y_for_normalized.iloc[train])
13 tr_21 = OL.score(X_train, y_train)
14 ts_21 = OL.score(X_test, y_test)
AttributeError: 'numpy.ndarray' object has no attribute 'iloc'
Decision Trees
new_model = DecisionTreeRegressor(max_depth=9,
min_samples_split=10,random_state=0)
import numpy as np
from sklearn.metrics import SCORERS
from sklearn.model_selection import KFold
scorer = SCORERS['r2']
cv = KFold(n_splits=5, random_state=0,shuffle=True)
train_scores, test_scores = [], []
for train, test in cv.split(X_normalized):
OO=new_model.fit(X_normalized.iloc[train], y_for_normalized.iloc[train])
tr_2 = OO.score(X_train, y_train)
ts_2 = OO.score(X_test, y_test)
print ("Train score:", tr_2) # from documentation .score returns r^2
print ("Test score:", ts_2) # from documentation .score returns r^2
train_scores.append(tr_2)
test_scores.append(ts_2)
print ("The Mean for Train scores is:",(np.mean(train_scores)))
print ("The Mean for Test scores is:",(np.mean(test_scores)))
Output
Train score: 0.8960560474997927
Test score: -0.15521696464773224
Train score: 0.8852795454592853
Test score: 0.17650772852710495
Train score: 0.5825347735306872
Test score: 0.34789159049344665
Train score: 0.8549575808716975
Test score: 0.7615265842042157
Train score: 0.8340261480334055
Test score: 0.14011826401728472
The Mean for Train scores is: 0.8105708190789735
The Mean for Test scores is: 0.2541654405188639
#Trial 1
import numpy as np
from sklearn.metrics import SCORERS
from sklearn.model_selection import KFold
scorer = SCORERS['r2']
cv = KFold(n_splits=5, random_state=0,shuffle=True)
train_scores, test_scores = [], []
for train, test in cv.split(X_normalized):
X_transform2 = poly.fit_transform(X_normalized)
OL=lin_regressor.fit(X_transform2[train], y_for_normalized[train])
tr_21 = OL.score(X_train, y_train)
ts_21 = OL.score(X_test, y_test)
print ("Train score:", tr_21) # from documentation .score returns r^2
print ("Test score:", ts_21) # from documentation .score returns r^2
train_scores.append(tr_21)
test_scores.append(ts_21)
print ("The Mean for Train scores is:",(np.mean(train_scores)))
print ("The Mean for Test scores is:",(np.mean(test_scores)))
Error message:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/var/folders/mm/r4gnnwl948zclfyx12w803040000gn/T/ipykernel_90924/12176184.py in <module>
10 for train, test in cv.split(X_normalized):
11 X_transform2 = poly.fit_transform(X_normalized)
---> 12 OL=lin_regressor.fit(X_transform2[train], y_for_normalized[train])
13 tr_21 = OL.score(X_train, y_train)
14 ts_21 = OL.score(X_test, y_test)
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
3462 if is_iterator(key):
3463 key = list(key)
-> 3464 indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
3465
3466 # take() does not accept boolean indexers
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1313
-> 1314 self._validate_read_indexer(keyarr, indexer, axis)
1315
1316 if needs_i8_conversion(ax.dtype) or isinstance(
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
1372 if use_interval_msg:
1373 key = list(key)
-> 1374 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
KeyError: "None of [Int64Index([ 0, 1, 3, 4, 5, 6, 9, 10, 11, 12, 14, 15, 17, 18, 19, 20, 21,\n 23, 25, 27, 28, 29, 31, 32, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,\n 44, 45, 46, 47, 48, 49, 50, 51, 52, 56, 57, 58, 59, 60, 61, 62, 63,\n 64, 65, 66, 67, 68, 69, 70, 71, 72, 74, 76, 77, 79, 80, 81, 82, 83,\n 84, 85, 87, 88, 89, 90, 91, 94, 96, 97, 98, 99],\n dtype='int64')] are in the [columns]"
CodePudding user response:
Understanding
poly.fit_transform
will returnnumpy.ndarray
so here yourX_normalized
is being transformed frompandas.core.frame.DataFrame
tonumpy.ndarray
.- But your
y_for_normalized
is stillpandas.core.frame.DataFrame
. - So in
numpy.ndarray
you pass indexes asnumpy.ndarray[indexes]
and forpandas.core.frame.DataFrame
you will pass indexes in.iloc[indexes]
respectively.
Solution
- For
X_transform2
use[]
for getting data as it'snumpy.ndarray
- For
y_for_normalized
use.iloc[]
as it'spandas.core.frame.DataFrame
Code
train_scores, test_scores = [], []
for train, test in cv.split(X_normalized):
X_transform2 = poly.fit_transform(X_normalized)
# [] for X_transform2, .iloc[] for y_for_normalized
OL = lin_regressor.fit(X_transform2[train], y_for_normalized.iloc[train])
tr_21 = OL.score(X_transform2[train], y_for_normalized.iloc[train])
ts_21 = OL.score(X_transform2[test], y_for_normalized.iloc[test])
print("Train score:", tr_21) # from documentation .score returns r^2
print("Test score:", ts_21) # from documentation .score returns r^2
train_scores.append(tr_21)
test_scores.append(ts_21)
print("The Mean for Train scores is:", (np.mean(train_scores)))
print("The Mean for Test scores is:", (np.mean(test_scores)))
PS:
- Don't know why are you using
X_train
,y_train
andX_test
,y_test
inOL.score
. It should be the dataset with index oftrain
andtest
generated bycv
. Same is reflected in a above code snippet.- If you have
X_train
,y_train
andX_test
,y_test
defined for specific reason then you are good to use.
- If you have
- Why are you using
PolynomialFeatures()
when you want all your feature to be of 1 degree which it's already is so it's making no difference to usePolynomialFeatures()
for 1 degree. - Also check for deprecation warning for
SCORER
if you are using new version ofsklearn
.