Function that calculate distance for data with both binary and numeric columns?-CodePudding

Is there a distance function that can calculate both binary and numeric column distances at once?

tibble( Observation = c(1:6), V1 = c(3, 5, 4, 6, 9, 5),
    V2 = c("a", "b", "a", "c", "b", "a"), 
    label = c("Red", "Red", "Blue", "Blue", "Red", "Blue")) %>% 
select(2:4) %>% 
fastDummies::dummy_cols() %>% 
select(c(-V2, -label))

I typically use dist(df, method = 'binary'), but now I have a numeric column with the new dummy columns I created. The numeric column, V1, is equally important as the dummy variables.

CodePudding user response：

There is a distmix function from kmed where we specify the index of numeric/binary/categorical columns in idnum/idbin/idcat respectively. It is mentioned in the ?distmix

idnum - A vector of column index of the numerical variables.

idbin - A vector of column index of the binary variables.

idcat - A vector of column index of the categorical variables.

library(kmed)
distmix(df1, idnum = 1, idbin = 2:ncol(df1))

In the example data, numeric column is the first column and all other columns are binary, thus we specify 2:ncol(df1) as index for idbin

data

df1 <- tibble( Observation = c(1:6), V1 = c(3, 5, 4, 6, 9, 5),
    V2 = c("a", "b", "a", "c", "b", "a"), 
    label = c("Red", "Red", "Blue", "Blue", "Red", "Blue")) %>% 
select(2:4) %>% 
fastDummies::dummy_cols() %>% 
select(c(-V2, -label))