Home > Net >  One hot fail - windows does not do one hot encoding
One hot fail - windows does not do one hot encoding

Time:10-20

I'm new to R and can not get my head arround why some very basic script does not perform one hot encoding in a windows-environment while it performs totally well in a linux-environment. As I have to work within the failing windows-environment I'd like to make the script perform one hot encoding.

This happenes within windows (one hot fail)

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.14.2 mltools_0.3.5

loaded via a namespace (and not attached):
[1] compiler_4.1.1  Matrix_1.3-4    tools_4.1.1     grid_4.1.1 lattice_0.20-44
>
> customers <- data.frame(
      id=c(10, 20, 30, 40, 50),
      gender=c('male', 'female', 'female', 'male', 'female'),
      mood=c('happy', 'sad', 'happy', 'sad','happy'),
      outcome=c(1, 1, 0, 0, 0))
>
> customers
  id gender  mood outcome
1 10   male happy       1
2 20 female   sad       1
3 30 female happy       0
4 40   male   sad       0
5 50 female happy       0
>
> library(data.table)
> library(mltools)
>
> customers_1h <- one_hot(as.data.table(customers))
>
> customers_1h
   id gender  mood outcome
1: 10   male happy       1
2: 20 female   sad       1
3: 30 female happy       0
4: 40   male   sad       0
5: 50 female happy       0 

while this is what I'd expect to happen - one hot encoding

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Leap 15.3

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8       
 [4] LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0   
> 
> customers <- data.frame(
      id=c(10, 20, 30, 40, 50),
      gender=c('male', 'female', 'female', 'male', 'female'),
      mood=c('happy', 'sad', 'happy', 'sad','happy'),
      outcome=c(1, 1, 0, 0, 0))
> 
> customers
  id gender  mood outcome
1 10   male happy       1
2 20 female   sad       1
3 30 female happy       0
4 40   male   sad       0
5 50 female happy       0
> 
> library(data.table)
data.table 1.14.2 using 8 threads (see ?getDTthreads).  Latest news: r-datatable.com
> library(mltools)
> 
> customers_1h <- one_hot(as.data.table(customers))
> 
> customers_1h
   id gender_female gender_male mood_happy mood_sad outcome
1: 10             0           1          1        0       1
2: 20             1           0          0        1       1
3: 30             1           0          1        0       0
4: 40             0           1          0        1       0
5: 50             1           0          1        0       0

At least the same packages seem to be installed. So why does one hot encoding not take place without at least some error? Can anyone give me a hint how I get windows behaving?

Many thanks in advance

Chris

CodePudding user response:

I think this has to do with your R versions, not the platform. One of the key defaults for creating data.frames, stringsAsFactors, got a new default (=FALSE) in R 4.0 after years of tripping up unsuspecting new users. However, some packages, such as it seems mltools, expect the kind of data frame that would be created using the old default, stringsAsFactors = TRUE. For more: https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html

I was able to replicate the problem and could fix it by setting stringsAsFactors = TRUE. (BTW, it looks like mltools::onehot expects a data.table as input, so I'm not sure there's a way to avoid using that package.)

Doesn't work:

customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))

mltools::one_hot(data.table::as.data.table(customers))

   id gender  mood outcome
1: 10   male happy       1
2: 20 female   sad       1
3: 30 female happy       0
4: 40   male   sad       0
5: 50 female happy       0

Works:

customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0), stringsAsFactors = TRUE)

mltools::one_hot(data.table::as.data.table(customers))


   id gender_female gender_male mood_happy mood_sad outcome
1: 10             0           1          1        0       1
2: 20             1           0          0        1       1
3: 30             1           0          1        0       0
4: 40             0           1          0        1       0
5: 50             1           0          1        0       0
  •  Tags:  
  • r
  • Related