I received a file that had a weird encoding and wondered if there's any way to check for 'corrupted' strings. For e.g.
dat <- c("天脊煤化工集团股份有é\231\220å…¬å\217¸", "AB \"\"Achema\"\"",
"Abu Qir Fertilizers & Chemical", "Abu Zaabal Fertilizer &",
"ADP - Adubos De Portugal SA")
The 1 and 2 element in above vector are corrupted since they have strings and escape characters in them. How can I filter these out or generate an index of corrupted strings in the vector dat
CodePudding user response:
error_string_idx <- which(
is.na(
iconv(
dat,
to = "ascii"
)
) | grepl('\\\\|\\"', dat)
)
CodePudding user response:
Try this
gsub("[^a-zA-Z]" , "" , dat)
if you don't want empty character use
Filter(function(x) nchar(x) , gsub("[^a-zA-Z]" , "" , dat))