How to remove these special characters in r in a set of string : â€™s, â€¦-CodePudding

I have this string which contain special characters, I am not able to remove these characters from the main data frame however, when I prepared a separate object by dft and then I use the following code, I was able to remove the special characters.

dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibarâ€¦ rt askthedr just bought m usd worth shib think itâ€™s robinhoodapp shibaarmy"

rmSpec <- "â|€|¦|â|€™|" # The "|" designates a logical OR in regular expressions.

s.rem <- gsub(rmSpec, "", dft) # gsub replace any matches in remSpec and replace them with "".
s.rem

But when I used the same code on the main data frame which is as follows in the form of different lines ( tweets ) , the same code won't work and show error : Error in UseMethod("inspect", x) : no applicable method for 'inspect' applied to an object of class "character"

[1] rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibarâ€¦ [2] rt askthedr just bought m usd worth shib think itâ€™s robinhoodapp shibaarmy
[3] rt bitshiba sending shib follow retweet tweet uufefufcd
[4] rt shibinform want shib get listed robinhoodappuf yes yes yes ubufef ubufef ubufef
[5] rt shiblucky shib giveaway just retweet follow

Request you to please help on this , thanks.

CodePudding user response：

To extract only letters and numbers we may use,

library(stringr)
    
dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibarâ€¦ rt askthedr just bought m usd worth shib think itâ€™s robinhoodapp shibaarmy"

str_replace_all(dft, "[^a-zA-Z0-9]", " ")
[1] "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar    rt askthedr just bought m usd worth shib think it   s robinhoodapp shibaarmy"

CodePudding user response：

As said, the prefered way is to read in your data with the correct encoding, but sometimes your data is just corrupt. My FixEncoding function creates a lookup named vector to fix this (up to 3 wrong encodings that I encountered in the past dealing with old csv files that were wrongly stored. You can use all unicode characters and wrongly encode. This way you can translate errors back.

FixEncoding <- function() {
  # create the unicode ranges from https://www.i18nqa.com/debug/utf8-debug.html
  range <- c(sprintf("%x", seq(strtoi("0xa0"), strtoi("0xff"))))
  unicode <- vapply(range, FUN.VALUE = character(1), function(x) { parse(text = paste0("'\\u00", x, "'"))[[1]] })
  # add the ones that are missing (red ones in https://www.i18nqa.com/debug/utf8-debug.html)
  unicode <- c(c("\u0168", "\u0152", "\u017d", "\u0153", "\u017e", "\u0178", "\u2019", "\u20ac", "\u201a", "\u0192", "\u201e", "\u2026", "\u2020", "\u2021", "\u02c6", "\u2030", "\u0160", "\u2030"), unicode)
  once <- vapply(unicode, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_once <- unicode
  names(fix_once) <- once
  twice <- vapply(once, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_twice <- unicode
  names(fix_twice) <- twice
  triple <- vapply(twice, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_triple <- unicode
  names(fix_triple) <- triple
  fixes <- c(fix_triple, fix_twice, fix_once)
  return(fixes)
}

fixes <- FixEncoding()

str(fixes)

 Named chr [1:342] "U" "Œ" "Ž" "œ" "ž" "Ÿ" "’" "\200" "‚" "ƒ" "„" "…" "†" "‡" "\210" "‰" "Š" "‰" " " "¡" "¢" "£" "¤" "¥" "¦" "§" "¨" "©" "ª" "«" "¬" "" "®" "¯" "°" "±" "²" "³" "´" ...
 - attr(*, "names")= chr [1:342] "Ãƒâ\200¦Ã‚Â¨" "Ãƒâ\200¦Ã¢â‚¬â„¢" "Ãƒâ\200¦Ã‚Â½" "Ãƒâ\200¦Ã¢â‚¬Å“" ...

dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibarâ€¦ rt askthedr just bought m usd worth shib think itâ€™s robinhoodapp shibaarmy"

stri_replace_all_fixed(dft, names(fixes), fixes, vectorize_all = F)

[1] "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"

Simple example of what happens

lets take lélèlölã which is in unicode "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"

messy <- "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"

messy
# [1] "lélèlölã"

# this is just what happens if files get corrupt by saving in wrong encodings, we did that unicode character by unicode character in my fix function.
Encoding(messy) <- "Windows-1252"
messy <- iconv(messy, to = "UTF-8")

messy
# [1] "lÃ©lÃ¨lÃ¶lÃ£" # once badly encoded
# [1] "lÃƒÂ©lÃƒÂ¨lÃƒÂ¶lÃƒÂ£" # twice badly encoded
# [1] "lÃƒÆ’Ã‚Â©lÃƒÆ’Ã‚Â¨lÃƒÆ’Ã‚Â¶lÃƒÆ’Ã‚Â£" # three times!

# All three strings above would be fixed
stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
# [1] "lélèlölã"

# other simple replacements as suggested would give us `llll` or `lÃlÃlÃlÃ`?