I have this string which contain special characters, I am not able to remove these characters from the main data frame however, when I prepared a separate object by dft and then I use the following code, I was able to remove the special characters.
dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"
rmSpec <- "â|€|¦|â|€™|" # The "|" designates a logical OR in regular expressions.
s.rem <- gsub(rmSpec, "", dft) # gsub replace any matches in remSpec and replace them with "".
s.rem
But when I used the same code on the main data frame which is as follows in the form of different lines ( tweets ) , the same code won't work and show error : Error in UseMethod("inspect", x) : no applicable method for 'inspect' applied to an object of class "character"
[1] rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar…
[2] rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy
[3] rt bitshiba sending shib follow retweet tweet uufefufcd
[4] rt shibinform want shib get listed robinhoodappuf yes yes yes ubufef ubufef ubufef
[5] rt shiblucky shib giveaway just retweet follow
Request you to please help on this , thanks.
CodePudding user response:
To extract only letters and numbers we may use,
library(stringr)
dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"
str_replace_all(dft, "[^a-zA-Z0-9]", " ")
[1] "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar rt askthedr just bought m usd worth shib think it s robinhoodapp shibaarmy"
CodePudding user response:
As said, the prefered way is to read in your data with the correct encoding, but sometimes your data is just corrupt. My FixEncoding function creates a lookup named vector to fix this (up to 3 wrong encodings that I encountered in the past dealing with old csv files that were wrongly stored. You can use all unicode characters and wrongly encode. This way you can translate errors back.
FixEncoding <- function() {
# create the unicode ranges from https://www.i18nqa.com/debug/utf8-debug.html
range <- c(sprintf("%x", seq(strtoi("0xa0"), strtoi("0xff"))))
unicode <- vapply(range, FUN.VALUE = character(1), function(x) { parse(text = paste0("'\\u00", x, "'"))[[1]] })
# add the ones that are missing (red ones in https://www.i18nqa.com/debug/utf8-debug.html)
unicode <- c(c("\u0168", "\u0152", "\u017d", "\u0153", "\u017e", "\u0178", "\u2019", "\u20ac", "\u201a", "\u0192", "\u201e", "\u2026", "\u2020", "\u2021", "\u02c6", "\u2030", "\u0160", "\u2030"), unicode)
once <- vapply(unicode, FUN.VALUE = character(1), function(x) {
Encoding(x) <- "Windows-1252"
iconv(x, to = "UTF-8")
})
fix_once <- unicode
names(fix_once) <- once
twice <- vapply(once, FUN.VALUE = character(1), function(x) {
Encoding(x) <- "Windows-1252"
iconv(x, to = "UTF-8")
})
fix_twice <- unicode
names(fix_twice) <- twice
triple <- vapply(twice, FUN.VALUE = character(1), function(x) {
Encoding(x) <- "Windows-1252"
iconv(x, to = "UTF-8")
})
fix_triple <- unicode
names(fix_triple) <- triple
fixes <- c(fix_triple, fix_twice, fix_once)
return(fixes)
}
fixes <- FixEncoding()
str(fixes)
Named chr [1:342] "U" "Œ" "Ž" "œ" "ž" "Ÿ" "’" "\200" "‚" "ƒ" "„" "…" "†" "‡" "\210" "‰" "Š" "‰" " " "¡" "¢" "£" "¤" "¥" "¦" "§" "¨" "©" "ª" "«" "¬" "" "®" "¯" "°" "±" "²" "³" "´" ...
- attr(*, "names")= chr [1:342] "Ãâ\200¦Ã‚¨" "Ãâ\200¦Ã¢â‚¬â„¢" "Ãâ\200¦Ã‚½" "Ãâ\200¦Ã¢â‚¬Å“" ...
dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"
stri_replace_all_fixed(dft, names(fixes), fixes, vectorize_all = F)
[1] "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"
Simple example of what happens
lets take lélèlölã
which is in unicode "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"
messy <- "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"
messy
# [1] "lélèlölã"
# this is just what happens if files get corrupt by saving in wrong encodings, we did that unicode character by unicode character in my fix function.
Encoding(messy) <- "Windows-1252"
messy <- iconv(messy, to = "UTF-8")
messy
# [1] "lélèlölã" # once badly encoded
# [1] "lélèlölã" # twice badly encoded
# [1] "lélèlölã" # three times!
# All three strings above would be fixed
stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
# [1] "lélèlölã"
# other simple replacements as suggested would give us `llll` or `lÃlÃlÃlÃ`?