Home > Software design >  R: strange results when looking at the unique elements of two simple strings
R: strange results when looking at the unique elements of two simple strings

Time:09-16

I am absolutely puzzled at what I see. I read an excel file and when I look at the unique values in a column of strings, I do not understand the result.

I can reproduce this in a minimal reprex (see below): why dd has two unique elements, wheread dd2 has just one?

Any suggestion is appreciated.

dd <- c("Grant", "Grant")


dd2 <- c("Grant", "Grant")

unique(dd)
#> [1] "Grant" "Grant"
length(unique(dd))
#> [1] 2

unique(dd2)
#> [1] "Grant"
length(unique(dd2))
#> [1] 1

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] knitr_1.33        magrittr_2.0.1    rlang_0.4.11      fansi_0.5.0      
#>  [5] stringr_1.4.0     styler_1.5.1      highr_0.9         tools_4.1.1      
#>  [9] xfun_0.25         utf8_1.2.2        withr_2.4.2       htmltools_0.5.1.1
#> [13] ellipsis_0.3.2    yaml_2.2.1        digest_0.6.27     tibble_3.1.3     
#> [17] lifecycle_1.0.0   crayon_1.4.1      purrr_0.3.4       vctrs_0.3.8      
#> [21] fs_1.5.0          glue_1.4.2        evaluate_0.14     rmarkdown_2.10   
#> [25] reprex_2.0.1      stringi_1.7.3     compiler_4.1.1    pillar_1.6.2     
#> [29] backports_1.2.1   pkgconfig_2.0.3

Created on 2021-09-13 by the reprex package (v2.0.1)

CodePudding user response:

The raw values seems to be different, probably from copying

sapply(dd, charToRaw)
$`Grant`
[1] ef bb bf 47 72 61 6e 74

$Grant
[1] 47 72 61 6e 74

whereas with dd2, it is the same

sapply(dd2, charToRaw)
     Grant Grant
[1,]    47    47
[2,]    72    72
[3,]    61    61
[4,]    6e    6e
[5,]    74    74

There seems to be an extra character in the first case

nchar(dd)
[1] 6 5

If we remove that first character, unique will be 1

unique(c(substring(dd[1],2), dd[2]))
[1] "Grant"
  • Related