I am trying to use caret's DummyVar function in R to convert some categorical data to numeric. My dataset has an id, town (A or B), district (d1,d2,d3), street(s1,s2,s3,s4), family(f1,f2,f3), gender(male, female), replicate (numeric). Here is a snapshot: Dataset Snapshot
Here is the code I currently have to decode the variables
library('caret')
train <- read.csv("HW1PB4Data_train.csv", header = TRUE)
dummy <- dummyVars("~ .", data = train)
train2 <- data.frame(predict(dummy, newdata = train))
train2
When I look at the output, train2, it has created a few additional towns (C,D,E) which did not exists in the original data. This does not happen with any of the other columns. Why is this? How do I fix it? Here is a snapshot of the output data: Output
CodePudding user response:
We can use tidyr::pivot_wider
or fastDummies::dummy_cols
Example data:
library(dplyr)
df <- tibble(subject = c(1.2, 1.5), town = c('a', 'b'), street = c('1', '2'))
# A tibble: 2 × 3
subject town street
<dbl> <chr> <chr>
1 1.2 a 1
2 1.5 b 2
Solution with tidyr
:
df %>% pivot_wider(names_from= c(town:street),
values_from = c(town:street),
values_fill = 0,
values_fn = ~1)
# A tibble: 2 × 5
subject town_a_1 town_b_2 street_a_1 street_b_2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1.2 1 0 1 0
2 1.5 0 1 0 1
solution with dummy_cols
:
> dummy_cols(df,
c("town", "street"),
remove_selected_columns = TRUE)
# A tibble: 2 × 5
subject town_a town_b street_1 street_2
<dbl> <int> <int> <int> <int>
1 1.2 1 0 1 0
2 1.5 0 1 0 1
CodePudding user response:
The above answer is already good. You can also go the easy way and just use an ifelse
statement to convert your data from categorical to numeric. An example dataset similar to yours:
train <- data.frame(subject = round(rnorm(n=100,
mean=5,
sd=2)), # rounded subjects
town = rep(c("A","B"),50),
district = rep(c("d1","d2"),50),
street = rep(c("s1","s2"),50),
family = rep(c("f1","f2"),50),
gender = rep(c("male","female"),50),
replicate = rbinom(n=100,
size=2,
prob=.9))
head(train)
Seen below:
subject town district street family gender replicate
1 6 A d1 s1 f1 male 2
2 4 B d2 s2 f2 female 2
3 4 A d1 s1 f1 male 1
4 7 B d2 s2 f2 female 2
5 3 A d1 s1 f1 male 2
6 6 B d2 s2 f2 female 2
Simply mutate the gender data with ifelse
by coding "male" as 0 and everything else ("female" in this case) as 1:
m.train <- train %>%
mutate(gender = ifelse(gender=="male",0,1))
head(m.train)
You get a transformed gender variable with 0's and 1's for dummy coding:
subject town district street family gender replicate
1 6 A d1 s1 f1 0 2
2 4 B d2 s2 f2 1 2
3 4 A d1 s1 f1 0 1
4 7 B d2 s2 f2 1 2
5 3 A d1 s1 f1 0 2
6 6 B d2 s2 f2 1 2