Home > Software design >  replace double by single backslash
replace double by single backslash

Time:06-13

I need to find a way to replace "\\" by "\" in a string using R. To be more specific, I have text data that is encoded as follows:

text <- c("K\\xc3\\xb6nnen", "S\\xc3\\xbcd")

I want to convert this to UTF-8, which would give the following result:

c("Können", "Süd")

However, the data above has one too many backslashes to convert it, i.e. I need to change the text-vector to:

text_correct <- c("K\xc3\xb6nnen", "S\xc3\xbcd")

Which would make it very easy to encode the data:

library(utf8)
as_utf8(text_correct)

I already googled a lot, but could not find a way to replace "\\" by "\" this using gsub or similar commands. I'm grateful for any help.

CodePudding user response:

Despite appearances, there are no double backslashes in your string. There are single backslashes. When you want a single backslash in a string in R, you need to type out two backslashes, as in your example.

This is because in an R string, a single backslash indicates that you are beginning an escape sequence. An escape sequence makes it possible to enter characters that would otherwise be difficult to handle. For example, if I want a newline character, my string would be "\n". This is not stored internally as a backslash and an "n", but rather as the ASCII character 0x0a, i.e. a newline character. The R parser 'sees' the sequence \n and reads it as meaning "I want a newline character here".

The reason for having backslash escapes is that we need a way to differentiate between, say, wanting a newline character and wanting a literal backslash followed by the character 'n'. In the latter case, our R string would be "\\n", and would be stored as two ASCII bytes: one for a backslash and one for a lower case 'n'.

You cannot use gsub to replace these double slashes for single slashes, since there are no double slashes, and the replacements don't have any slashes. Although the sequence \xc3 looks like it has a backslash, it doesn't. It is just your way of telling R that you want the single ASCII character 0xc3 in your string.

Essentially your input string has been 'double escaped', and to convert those \\xc3 entries to the bytes they are supposed to represent, you need to unescape them.

Even then, the encoding is not a bytewise representation of the correct UTF-8 characters, so you need to unescape the string using stringi::stri_unescape_unicode, but convert that to native encoding, then reinterpret it as UTF-8:

text <- c("K\\xc3\\xb6nnen", "S\\xc3\\xbcd")

text <- enc2native(stringi::stri_unescape_unicode(text))
Encoding(text) <- 'UTF-8'
text
#> [1] "Können" "Süd"

CodePudding user response:

This works for me on Windows in R 4.2.

Now if the string had been written with single backslashes it would have worked:

c("K\xc3\xb6nnen", "S\xc3\xbcd")
## [1] "Können" "Süd"   

but to the parser double backslash within a character string is a single backslash so just parse it and convert to character. No packages are used.

text <- c("K\\xc3\\xb6nnen", "S\\xc3\\xbcd")

as.character(str2expression(sprintf('"%s"', text)))
## [1] "Können" "Süd"   

It can alternately be written as a pipeline.

text |>
  sprintf(fmt = '"%s"') |>
  str2expression() |>
  as.character()

In R 4.1 to get it to work additionally change the encoding to UTF-8.

result <- as.character(str2expression(sprintf('"%s"', text)))
Encoding(result) <- "UTF-8"

CodePudding user response:

I think the solution proposed by @allan-cameron should work for windows users. For mac users, I did not find a better / less brutal solution than this:

(1) copy the table from https://www.i18nqa.com/debug/utf8-debug.html and keep columns "expected" and "actual"

(2) Sort table by number of characters in "Actual", starting with the longest string and save as conversion.csv

(3) run the following code:

# Read conversion table:
conversion <- read.csv2("conversion.csv", sep=",")

# Run code suggested above    
text <- c("K\\xc3\\xb6nnen", "S\\xc3\\xbcd")
text <- enc2native(stringi::stri_unescape_unicode(text))
# this gives: "KÁ¶nnen" "SÁ¼d"   

# next, loop over conversion table and manually replace miscodings:
for(i in 1:nrow(conversion)){
  text <- gsub(conversion$actual[i],conversion$expected[i], text)
}
text
# this returns: "Können" "Süd"   
  • Related