Following the example of vertica
at https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/AnalyzingData/MachineLearning/DataPreparation/EncodingCategoricalColumns.htm?tocpath=Analyzing Data|Machine Learning for Predictive Analytics|Data Preparation|_____3
which uses Titanic data from kaggle
,
ONE_HOT_ENCODER_FIT
function coverts categorical data and creates a model which represents the new representation of categorical data
SELECT one_hot_encoder_fit('public.titanic_encoder','titanic_training','sex, embarkation_point' USING PARAMETERS exclude_columns='', output_view='', extra_levels='{}');
==================
varchar_categories
==================
category_name |category_level|category_level_index
----------------- -------------- --------------------
embarkation_point| C | 0
embarkation_point| Q | 1
embarkation_point| S | 2 <- note S is 2
embarkation_point| | 3
sex | female | 0
sex | male | 1 <-- note male is 1
Then on applying the model titanic_encoder
like this on titanic_training
data, why does embarkation_point_2
gets added? Should the output contain only the categorical value (say S
) and its encoded value ? Why do I see values 0
and 1
and not 2
(which is the encoded value for S
? Similar to sex
M
and sex_1
1
dbadmin@2e4e746b3e6c(*)=> select * from titanic_training limit 1;
passenger_id | survived | pclass | name | sex | age | sibling_and_spouse_count | parent_and_child_count | ticket | fare | cabin | embarkation_point
-------------- ---------- -------- ------------------------- ------ ----- -------------------------- ------------------------ ----------- ------ ------- -------------------
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | | S <-- note S
(1 row)
dbadmin@2e4e746b3e6c(*)=> SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder') from titanic_training limit 1;
passenger_id | survived | pclass | name | sex | sex_1 | age | sibling_and_spouse_count | parent_and_child_count | ticket | fare | cabin | embarkation_point | embarkation_point_1 | embarkation_point_2 (<-- why this is here)?
-------------- ---------- -------- ------------------------- ------ ------- ----- -------------------------- ------------------------ ----------- ------ ------- ------------------- --------------------- ---------------------
1 | 0 | 3 | Braund, Mr. Owen Harris | male <- note male| 1 <- note encoded value of male | 22 | 1 | 0 | A/5 21171 | 7.25 | | S <- note S | 0 <- why this is here | 1 <-- why this is here. Where is 2?
(1 row)
Why there is no embarkation_point_3
?
CodePudding user response:
There are many reasons to your output. First, read the documentation of the APPLY_ONE_HOT_ENCODER: https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/APPLY_ONE_HOT_ENCODER.htm?tocpath=SQL Reference Manual|SQL Functions|Machine Learning Functions|Transformation Functions|_____5
Two parameters allow you to achieve your goals:
- drop_first: set it to false to get all the columns. One is dropped because of correlations purposes. You can read this article: https://inmachineswetrust.com/posts/drop-first-columns/ There are pros and cons.
- column_naming: set it to values but be careful. If you have categories with special characters, you might face some difficulties.
Badr