Home > OS >  How to remove these special characters in r in a set of string : ’s, …
How to remove these special characters in r in a set of string : ’s, …

Time:12-10

I have this string which contain special characters, I am not able to remove these characters from the main data frame however, when I prepared a separate object by dft and then I use the following code, I was able to remove the special characters.

dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"

rmSpec <- "â|€|¦|â|€™|" # The "|" designates a logical OR in regular expressions.

s.rem <- gsub(rmSpec, "", dft) # gsub replace any matches in remSpec and replace them with "".
s.rem

But when I used the same code on the main data frame which is as follows in the form of different lines ( tweets ) , the same code won't work and show error : Error in UseMethod("inspect", x) : no applicable method for 'inspect' applied to an object of class "character"

[1] rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… [2] rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy
[3] rt bitshiba sending shib follow retweet tweet uufefufcd
[4] rt shibinform want shib get listed robinhoodappuf yes yes yes ubufef ubufef ubufef
[5] rt shiblucky shib giveaway just retweet follow

Request you to please help on this , thanks.

CodePudding user response:

To extract only letters and numbers we may use,

library(stringr)
    
dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"

str_replace_all(dft, "[^a-zA-Z0-9]", " ")
[1] "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar    rt askthedr just bought m usd worth shib think it   s robinhoodapp shibaarmy"

CodePudding user response:

As said, the prefered way is to read in your data with the correct encoding, but sometimes your data is just corrupt. My FixEncoding function creates a lookup named vector to fix this (up to 3 wrong encodings that I encountered in the past dealing with old csv files that were wrongly stored. You can use all unicode characters and wrongly encode. This way you can translate errors back.

FixEncoding <- function() {
  # create the unicode ranges from https://www.i18nqa.com/debug/utf8-debug.html
  range <- c(sprintf("%x", seq(strtoi("0xa0"), strtoi("0xff"))))
  unicode <- vapply(range, FUN.VALUE = character(1), function(x) { parse(text = paste0("'\\u00", x, "'"))[[1]] })
  # add the ones that are missing (red ones in https://www.i18nqa.com/debug/utf8-debug.html)
  unicode <- c(c("\u0168", "\u0152", "\u017d", "\u0153", "\u017e", "\u0178", "\u2019", "\u20ac", "\u201a", "\u0192", "\u201e", "\u2026", "\u2020", "\u2021", "\u02c6", "\u2030", "\u0160", "\u2030"), unicode)
  once <- vapply(unicode, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_once <- unicode
  names(fix_once) <- once
  twice <- vapply(once, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_twice <- unicode
  names(fix_twice) <- twice
  triple <- vapply(twice, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_triple <- unicode
  names(fix_triple) <- triple
  fixes <- c(fix_triple, fix_twice, fix_once)
  return(fixes)
}

fixes <- FixEncoding()

str(fixes)

 Named chr [1:342] "U" "Œ" "Ž" "œ" "ž" "Ÿ" "’" "\200" "‚" "ƒ" "„" "…" "†" "‡" "\210" "‰" "Š" "‰" " " "¡" "¢" "£" "¤" "¥" "¦" "§" "¨" "©" "ª" "«" "¬" "­" "®" "¯" "°" "±" "²" "³" "´" ...
 - attr(*, "names")= chr [1:342] "Ãâ\200¦Ã‚¨" "Ãâ\200¦Ã¢â‚¬â„¢" "Ãâ\200¦Ã‚½" "Ãâ\200¦Ã¢â‚¬Å“" ...

dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"

stri_replace_all_fixed(dft, names(fixes), fixes, vectorize_all = F)

[1] "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"

Simple example of what happens

lets take lélèlölã which is in unicode "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"

messy <- "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"

messy
# [1] "lélèlölã"

# this is just what happens if files get corrupt by saving in wrong encodings, we did that unicode character by unicode character in my fix function.
Encoding(messy) <- "Windows-1252"
messy <- iconv(messy, to = "UTF-8")

messy
# [1] "lélèlölã" # once badly encoded
# [1] "lélèlölã" # twice badly encoded
# [1] "lélèlölã" # three times!

# All three strings above would be fixed
stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
# [1] "lélèlölã"

# other simple replacements as suggested would give us `llll` or `lÃlÃlÃlÃ`?
  • Related