Using the openxlsx package in R, I am importing data from an Excel file that originated in Brazil. In the character strings, there seem to be hidden characters. As you can see from my code, I remove the white space but strings 2 and 3 are still shown as unique strings. Strings 6 and 7 are also appearing as unique strings when they should be identical.
moths %<>%
mutate(Details = str_trim(Details))
sort(unique(moths$Details))
[1] "Check Outside" "Check Plot"
[3] "Check Plot " "Check Plot (between treatments)"
[5] "PRX-01GA1-21022.00" "PRX-01GA2-22001"
[7] "PRX-01GA2-22001" "PRX-01GA2-22002"
[9] "PRX-01GA2-22002" "PRX-01GA2-22003"
[11] "PRX-01GA2-22004" "SF2.5VP"
[13] "YM001-22" "YM001-PRX-01GA2-22001 PRX-01GA2-22001"
Unfortunately, since I can't attach the Excel file that the data are coming from, I can't make a completely reproducible example here, but hopefully someone can still provide some insight.
CodePudding user response:
you haven't saved your cleaned data
moths %<>%
mutate(Details = str_trim(Details)) -> moths
sort(unique(moths$Details))
CodePudding user response:
There may be some non-ascii characters in your data. If you're happy to remove them, you can use textclean, like so (this example uses the first 4 values of your data):
vec <- c("Check Outside", "Check Plot", "Check Plot ",
"Check Plot (between treatments)")
unique(vec)
# [1] "Check Outside" "Check Plot"
# [3] "Check Plot " "Check Plot (between treatments)"
library(textclean)
vec2 <- replace_non_ascii(vec)
unique(vec2)
# [1] "Check Outside" "Check Plot" "Check Plot (between treatments)"
So tl;dr this should do what you’re after
library(textclean)
moths <- moths %>%
mutate(Details = replace_non_ascii(str_trim(Details)))