How to remove hidden characters in R from string imported from Excel?-CodePudding

Using the openxlsx package in R, I am importing data from an Excel file that originated in Brazil. In the character strings, there seem to be hidden characters. As you can see from my code, I remove the white space but strings 2 and 3 are still shown as unique strings. Strings 6 and 7 are also appearing as unique strings when they should be identical.

moths %<>%
  mutate(Details = str_trim(Details))
sort(unique(moths$Details))

[1] "Check Outside" "Check Plot"
[3] "Check Plot " "Check Plot  (between treatments)"
[5] "PRX-01GA1-21022.00" "PRX-01GA2-22001"
[7] "PRX-01GA2-22001" "PRX-01GA2-22002"
[9] "PRX-01GA2-22002" "PRX-01GA2-22003"
[11] "PRX-01GA2-22004" "SF2.5VP"
[13] "YM001-22" "YM001-PRX-01GA2-22001 PRX-01GA2-22001"

Unfortunately, since I can't attach the Excel file that the data are coming from, I can't make a completely reproducible example here, but hopefully someone can still provide some insight.

CodePudding user response：

you haven't saved your cleaned data

moths %<>%
  mutate(Details = str_trim(Details)) -> moths

sort(unique(moths$Details))

CodePudding user response：

There may be some non-ascii characters in your data. If you're happy to remove them, you can use textclean, like so (this example uses the first 4 values of your data):

vec <- c("Check Outside", "Check Plot", "Check Plot ", 
         "Check Plot  (between treatments)")
unique(vec) 
# [1] "Check Outside"        "Check Plot"  
# [3] "Check Plot "          "Check Plot  (between treatments)"


library(textclean)
vec2 <- replace_non_ascii(vec)
unique(vec2)
# [1] "Check Outside"    "Check Plot"  "Check Plot (between treatments)"

So tl;dr this should do what you’re after


library(textclean)

moths <- moths %>%
  mutate(Details = replace_non_ascii(str_trim(Details)))