Caret dummy variable does not work as expected-CodePudding

I am trying to use caret's DummyVar function in R to convert some categorical data to numeric. My dataset has an id, town (A or B), district (d1,d2,d3), street(s1,s2,s3,s4), family(f1,f2,f3), gender(male, female), replicate (numeric). Here is a snapshot: Dataset Snapshot

Here is the code I currently have to decode the variables

library('caret')
train <- read.csv("HW1PB4Data_train.csv", header = TRUE)
dummy <- dummyVars("~ .", data = train)
train2 <- data.frame(predict(dummy, newdata = train))
train2

When I look at the output, train2, it has created a few additional towns (C,D,E) which did not exists in the original data. This does not happen with any of the other columns. Why is this? How do I fix it? Here is a snapshot of the output data: Output

CodePudding user response：

We can use tidyr::pivot_wider or fastDummies::dummy_cols

Example data:

library(dplyr)

df <- tibble(subject = c(1.2, 1.5), town = c('a', 'b'), street = c('1', '2'))

# A tibble: 2 × 3
  subject town  street
    <dbl> <chr> <chr> 
1     1.2 a     1     
2     1.5 b     2

Solution with tidyr:

df %>% pivot_wider(names_from= c(town:street),
                   values_from = c(town:street),
                   values_fill = 0,
                   values_fn = ~1)

# A tibble: 2 × 5
  subject town_a_1 town_b_2 street_a_1 street_b_2
    <dbl>    <dbl>    <dbl>      <dbl>      <dbl>
1     1.2        1        0          1          0
2     1.5        0        1          0          1

solution with dummy_cols:

> dummy_cols(df,
             c("town", "street"),
             remove_selected_columns = TRUE)

# A tibble: 2 × 5
  subject town_a town_b street_1 street_2
    <dbl>  <int>  <int>    <int>    <int>
1     1.2      1      0        1        0
2     1.5      0      1        0        1

CodePudding user response：

The above answer is already good. You can also go the easy way and just use an ifelse statement to convert your data from categorical to numeric. An example dataset similar to yours:

train <- data.frame(subject = round(rnorm(n=100,
                                    mean=5,
                                    sd=2)), # rounded subjects
                    town = rep(c("A","B"),50),
                    district = rep(c("d1","d2"),50),
                    street = rep(c("s1","s2"),50),
                    family = rep(c("f1","f2"),50),
                    gender = rep(c("male","female"),50),
                    replicate = rbinom(n=100,
                                       size=2,
                                       prob=.9))
head(train)

Seen below:

  subject town district street family gender replicate
1       6    A       d1     s1     f1   male         2
2       4    B       d2     s2     f2 female         2
3       4    A       d1     s1     f1   male         1
4       7    B       d2     s2     f2 female         2
5       3    A       d1     s1     f1   male         2
6       6    B       d2     s2     f2 female         2

Simply mutate the gender data with ifelse by coding "male" as 0 and everything else ("female" in this case) as 1:

m.train <- train %>% 
  mutate(gender = ifelse(gender=="male",0,1))
head(m.train)

You get a transformed gender variable with 0's and 1's for dummy coding:

  subject town district street family gender replicate
1       6    A       d1     s1     f1      0         2
2       4    B       d2     s2     f2      1         2
3       4    A       d1     s1     f1      0         1
4       7    B       d2     s2     f2      1         2
5       3    A       d1     s1     f1      0         2
6       6    B       d2     s2     f2      1         2