pandas error: ValueError: at least one array or dtype is required-CodePudding

I just want to map the categorical features to numeric features.

when I just use continuous features for prediction, the decision tree works well.

however, after I replace these features, there are some error.

the df.info() gets as follows,

<class 'pandas.core.frame.DataFrame'>
Int64Index: 114641 entries, 0 to 145458
Data columns (total 19 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           114641 non-null  object 
 1   Location       114641 non-null  object 
 2   MinTemp        114641 non-null  float64
 3   MaxTemp        114641 non-null  float64
 4   Rainfall       114641 non-null  float64
 5   WindGustDir    114641 non-null  object 
 6   WindGustSpeed  114641 non-null  float64
 7   WindDir9am     114641 non-null  object 
 8   WindDir3pm     114641 non-null  object 
 9   WindSpeed9am   114641 non-null  float64
 10  WindSpeed3pm   114641 non-null  float64
 11  Humidity9am    114641 non-null  float64
 12  Humidity3pm    114641 non-null  float64
 13  Pressure9am    114641 non-null  float64
 14  Pressure3pm    114641 non-null  float64
 15  Temp9am        114641 non-null  float64
 16  Temp3pm        114641 non-null  float64
 17  RainToday      114641 non-null  object 
 18  RainTomorrow   114641 non-null  object 
dtypes: float64(12), object(7)
memory usage: 17.5  MB
None

my code

# input: dataframe, a categorical feature name
# output: mapping dictionary of the categorical feature
def get_mapping_function(dataframe, featureName):
    items = dataframe[featureName].value_counts().index
    index = 0
    item_dic = {}
    for item in items:
        if item not in item_dic.keys():
            item_dic[item] = index
            index  = 1
    return item_dic

# input a dataframe
# return a list of categorical features' name (except the target feature)
def get_categorical_features(dataframe):
    categorical_features = dataframe.select_dtypes(include=['object'])
    return categorical_features.columns.tolist()[:-1]

# input a dataframe
# return a dataframe with categorical features are mapped
def map_categorical_features(dataframe):
    categorical_features_list = get_categorical_features(dataframe)
    for item in categorical_features_list:
        item_mapping_function = get_mapping_function(dataframe, item)
        dataframe[item] = dataframe[item].map(item_mapping_function)
    return dataframe

mapped_df = map_categorical_features(df)

categorical_features = get_categorical_features(df)
X = mapped_df[categorical_features]
y = mapped_df['RainTomorrow']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

entropy_result = []
gini_result = []
depth = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
#
for i in depth:
    entropy_tree = DecisionTreeClassifier(criterion='entropy', max_depth=i, random_state=42)
    gini_tree = DecisionTreeClassifier(criterion='gini', max_depth=i, random_state=42)
    entropy_tree.fit(X_train, y_train)
    gini_tree.fit(X_train, y_train)

the full error traceback:

Traceback (most recent call last):
  File "D:/University_Files/Stage4_1/COMP3010J Machine Learning/Final Project/mapping_test.py", line 93, in <module>
    entropy_tree.fit(X_train, y_train)
  File "C:\A_Rone\App_installation\Anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 898, in fit
    super().fit(
  File "C:\A_Rone\App_installation\Anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 156, in fit
    X, y = self._validate_data(X, y,
  File "C:\A_Rone\App_installation\Anaconda3\lib\site-packages\sklearn\base.py", line 430, in _validate_data
    X = check_array(X, **check_X_params)
  File "C:\A_Rone\App_installation\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\A_Rone\App_installation\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 540, in check_array
    dtype_orig = np.result_type(*dtypes_orig)
  File "<__array_function__ internals>", line 5, in result_type
ValueError: at least one array or dtype is required

I guess the assignment of y might be the error, but I do not know how to solve it.

print(X_train, X_test, y_train, y_test) Empty DataFrame Columns: [] Index: [97732, 47179, 90514, 37965, 61357, 108402, 136796, 67067, 31363, 31279, 49001, 18616, 122381, 33763, 65280, 83481, 128230, 49320, 51874, 66537, 55868, 67078, 22454, 134485, 35378, 12044, 73579, 50406, 138150, 127639, 132684, 87756, 87190, 142457, 136687, 84706, 86256, 122486, 101164, 100699, 33662, 19845, 37830, 64909, 46547, 114277, 70964, 34087, 103691, 118351, 6693, 136364, 40065, 100480, 73413, 106728, 2421, 4937, 103863, 63212, 93006, 73444, 88837, 80963, 95315, 86467, 88366, 50543, 107716, 38372, 105056, 94533, 91286, 72277, 107540, 64876, 142346, 61742, 31468, 80789, 8366, 4823, 104993, 71519, 48920, 91247, 77687, 98786, 64832, 56107, 40064, 127695, 95310, 55509, 60964, 133396, 22413, 108102, 35808, 9354, ...]

[68784 rows x 0 columns] Empty DataFrame Columns: [] Index: [131088, 104247, 9064, 44737, 79256, 121040, 60069, 48557, 71471, 939, 10841, 58492, 75776, 105224, 111890, 33455, 95278, 36808, 132926, 21194, 145243, 117451, 104100, 12575, 77387, 139177, 2093, 83336, 9592, 12922, 144592, 72549, 74899, 29264, 106665, 104979, 69418, 78832, 113960, 64850, 28800, 61596, 68775, 56243, 58247, 39398, 131100, 30140, 115252, 7773, 119882, 89388, 81229, 10129, 95636, 12117, 81936, 59539, 136470, 88898, 65355, 29350, 56864, 28719, 19608, 41586, 87230, 83818, 117414, 5573, 29660, 14733, 65081, 86923, 120223, 59517, 76030, 92979, 48380, 86802, 107474, 18555, 93250, 4907, 96425, 88040, 22975, 114776, 7989, 123267, 117246, 112733, 123029, 121637, 142782, 57718, 61341, 139014, 6690, 100863, ...]

[45857 rows x 0 columns] 97732 Yes 47179 No 90514 No 37965 No 61357 No ... 60621 No 77412 No 43418 No 63855 Yes 127070 Yes Name: RainTomorrow, Length: 68784, dtype: object 131088 No 104247 No 9064 No 44737 No 79256 No .. 43533 No 84797 No 130927 No 22128 No 76088 No Name: RainTomorrow, Length: 45857, dtype: object

CodePudding user response：

From the prints added to the end of the question, it looks like it's caused by the fact that your X_train and X_test variables are empty dataframes.