I'm new to R and can not get my head arround why some very basic script does not perform one hot encoding in a windows-environment while it performs totally well in a linux-environment. As I have to work within the failing windows-environment I'd like to make the script perform one hot encoding.
This happenes within windows (one hot fail)
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.14.2 mltools_0.3.5
loaded via a namespace (and not attached):
[1] compiler_4.1.1 Matrix_1.3-4 tools_4.1.1 grid_4.1.1 lattice_0.20-44
>
> customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))
>
> customers
id gender mood outcome
1 10 male happy 1
2 20 female sad 1
3 30 female happy 0
4 40 male sad 0
5 50 female happy 0
>
> library(data.table)
> library(mltools)
>
> customers_1h <- one_hot(as.data.table(customers))
>
> customers_1h
id gender mood outcome
1: 10 male happy 1
2: 20 female sad 1
3: 30 female happy 0
4: 40 male sad 0
5: 50 female happy 0
while this is what I'd expect to happen - one hot encoding
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Leap 15.3
Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8
[4] LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0
>
> customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))
>
> customers
id gender mood outcome
1 10 male happy 1
2 20 female sad 1
3 30 female happy 0
4 40 male sad 0
5 50 female happy 0
>
> library(data.table)
data.table 1.14.2 using 8 threads (see ?getDTthreads). Latest news: r-datatable.com
> library(mltools)
>
> customers_1h <- one_hot(as.data.table(customers))
>
> customers_1h
id gender_female gender_male mood_happy mood_sad outcome
1: 10 0 1 1 0 1
2: 20 1 0 0 1 1
3: 30 1 0 1 0 0
4: 40 0 1 0 1 0
5: 50 1 0 1 0 0
At least the same packages seem to be installed. So why does one hot encoding not take place without at least some error? Can anyone give me a hint how I get windows behaving?
Many thanks in advance
Chris
CodePudding user response:
I think this has to do with your R versions, not the platform. One of the key defaults for creating data.frames, stringsAsFactors
, got a new default (=FALSE)
in R 4.0 after years of tripping up unsuspecting new users. However, some packages, such as it seems mltools
, expect the kind of data frame that would be created using the old default, stringsAsFactors = TRUE
. For more: https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html
I was able to replicate the problem and could fix it by setting stringsAsFactors = TRUE
. (BTW, it looks like mltools::onehot
expects a data.table as input, so I'm not sure there's a way to avoid using that package.)
Doesn't work:
customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))
mltools::one_hot(data.table::as.data.table(customers))
id gender mood outcome
1: 10 male happy 1
2: 20 female sad 1
3: 30 female happy 0
4: 40 male sad 0
5: 50 female happy 0
Works:
customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0), stringsAsFactors = TRUE)
mltools::one_hot(data.table::as.data.table(customers))
id gender_female gender_male mood_happy mood_sad outcome
1: 10 0 1 1 0 1
2: 20 1 0 0 1 1
3: 30 1 0 1 0 0
4: 40 0 1 0 1 0
5: 50 1 0 1 0 0