I just want to map the categorical features to numeric features.
when I just use continuous features for prediction, the decision tree works well.
however, after I replace these features, there are some error.
the df.info() gets as follows,
<class 'pandas.core.frame.DataFrame'>
Int64Index: 114641 entries, 0 to 145458
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 114641 non-null object
1 Location 114641 non-null object
2 MinTemp 114641 non-null float64
3 MaxTemp 114641 non-null float64
4 Rainfall 114641 non-null float64
5 WindGustDir 114641 non-null object
6 WindGustSpeed 114641 non-null float64
7 WindDir9am 114641 non-null object
8 WindDir3pm 114641 non-null object
9 WindSpeed9am 114641 non-null float64
10 WindSpeed3pm 114641 non-null float64
11 Humidity9am 114641 non-null float64
12 Humidity3pm 114641 non-null float64
13 Pressure9am 114641 non-null float64
14 Pressure3pm 114641 non-null float64
15 Temp9am 114641 non-null float64
16 Temp3pm 114641 non-null float64
17 RainToday 114641 non-null object
18 RainTomorrow 114641 non-null object
dtypes: float64(12), object(7)
memory usage: 17.5 MB
None
my code
# input: dataframe, a categorical feature name
# output: mapping dictionary of the categorical feature
def get_mapping_function(dataframe, featureName):
items = dataframe[featureName].value_counts().index
index = 0
item_dic = {}
for item in items:
if item not in item_dic.keys():
item_dic[item] = index
index = 1
return item_dic
# input a dataframe
# return a list of categorical features' name (except the target feature)
def get_categorical_features(dataframe):
categorical_features = dataframe.select_dtypes(include=['object'])
return categorical_features.columns.tolist()[:-1]
# input a dataframe
# return a dataframe with categorical features are mapped
def map_categorical_features(dataframe):
categorical_features_list = get_categorical_features(dataframe)
for item in categorical_features_list:
item_mapping_function = get_mapping_function(dataframe, item)
dataframe[item] = dataframe[item].map(item_mapping_function)
return dataframe
mapped_df = map_categorical_features(df)
categorical_features = get_categorical_features(df)
X = mapped_df[categorical_features]
y = mapped_df['RainTomorrow']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
entropy_result = []
gini_result = []
depth = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
#
for i in depth:
entropy_tree = DecisionTreeClassifier(criterion='entropy', max_depth=i, random_state=42)
gini_tree = DecisionTreeClassifier(criterion='gini', max_depth=i, random_state=42)
entropy_tree.fit(X_train, y_train)
gini_tree.fit(X_train, y_train)
the full error traceback:
Traceback (most recent call last):
File "D:/University_Files/Stage4_1/COMP3010J Machine Learning/Final Project/mapping_test.py", line 93, in <module>
entropy_tree.fit(X_train, y_train)
File "C:\A_Rone\App_installation\Anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 898, in fit
super().fit(
File "C:\A_Rone\App_installation\Anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 156, in fit
X, y = self._validate_data(X, y,
File "C:\A_Rone\App_installation\Anaconda3\lib\site-packages\sklearn\base.py", line 430, in _validate_data
X = check_array(X, **check_X_params)
File "C:\A_Rone\App_installation\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\A_Rone\App_installation\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 540, in check_array
dtype_orig = np.result_type(*dtypes_orig)
File "<__array_function__ internals>", line 5, in result_type
ValueError: at least one array or dtype is required
I guess the assignment of y might be the error, but I do not know how to solve it.
print(X_train, X_test, y_train, y_test)
Empty DataFrame
Columns: []
Index: [97732, 47179, 90514, 37965, 61357, 108402, 136796, 67067, 31363, 31279, 49001, 18616, 122381, 33763, 65280, 83481, 128230, 49320, 51874, 66537, 55868, 67078, 22454, 134485, 35378, 12044, 73579, 50406, 138150, 127639, 132684, 87756, 87190, 142457, 136687, 84706, 86256, 122486, 101164, 100699, 33662, 19845, 37830, 64909, 46547, 114277, 70964, 34087, 103691, 118351, 6693, 136364, 40065, 100480, 73413, 106728, 2421, 4937, 103863, 63212, 93006, 73444, 88837, 80963, 95315, 86467, 88366, 50543, 107716, 38372, 105056, 94533, 91286, 72277, 107540, 64876, 142346, 61742, 31468, 80789, 8366, 4823, 104993, 71519, 48920, 91247, 77687, 98786, 64832, 56107, 40064, 127695, 95310, 55509, 60964, 133396, 22413, 108102, 35808, 9354, ...]
[68784 rows x 0 columns] Empty DataFrame Columns: [] Index: [131088, 104247, 9064, 44737, 79256, 121040, 60069, 48557, 71471, 939, 10841, 58492, 75776, 105224, 111890, 33455, 95278, 36808, 132926, 21194, 145243, 117451, 104100, 12575, 77387, 139177, 2093, 83336, 9592, 12922, 144592, 72549, 74899, 29264, 106665, 104979, 69418, 78832, 113960, 64850, 28800, 61596, 68775, 56243, 58247, 39398, 131100, 30140, 115252, 7773, 119882, 89388, 81229, 10129, 95636, 12117, 81936, 59539, 136470, 88898, 65355, 29350, 56864, 28719, 19608, 41586, 87230, 83818, 117414, 5573, 29660, 14733, 65081, 86923, 120223, 59517, 76030, 92979, 48380, 86802, 107474, 18555, 93250, 4907, 96425, 88040, 22975, 114776, 7989, 123267, 117246, 112733, 123029, 121637, 142782, 57718, 61341, 139014, 6690, 100863, ...]
[45857 rows x 0 columns] 97732 Yes 47179 No 90514 No 37965 No 61357 No ... 60621 No 77412 No 43418 No 63855 Yes 127070 Yes Name: RainTomorrow, Length: 68784, dtype: object 131088 No 104247 No 9064 No 44737 No 79256 No .. 43533 No 84797 No 130927 No 22128 No 76088 No Name: RainTomorrow, Length: 45857, dtype: object
CodePudding user response:
From the prints added to the end of the question, it looks like it's caused by the fact that your X_train
and X_test
variables are empty dataframes.