Home > Software engineering >  Decoding strings including utf8-literals like '\xc3\xa6' in Swift?
Decoding strings including utf8-literals like '\xc3\xa6' in Swift?

Time:11-23

Follow up question to my former thread about UTF-8 literals:

It was established that you can decode UTF-8 literals from string like this that exclusively includes UTF-8 literals:

let s = "\\xc3\\xa6"
let bytes = s
    .components(separatedBy: "\\x")
    // components(separatedBy:) would produce an empty string as the first element
    // because the string starts with "\x". We drop this
    .dropFirst() 
    .compactMap { UInt8($0, radix: 16) }
if let decoded = String(bytes: bytes, encoding: .utf8) {
    print(decoded)
} else {
    print("The UTF8 sequence was invalid!")
}

However this only works if the string only contains UTF-8 literals. As I am fetching a Wi-Fi list of names that has these UTF-8 literals within, how do I go about decoding the entire string?

Example:

let s = "This is a WiFi Name \\xc3\\xa6 including UTF-8 literals \\xc3\\xb8"

With the expected result:

print(s)
> This is a WiFi Name æ including UTF-8 literals ø

In Python there is a simple solution to this:

contents = source_file.read()
uni = contents.decode('unicode-escape')
enc = uni.encode('latin1')
dec = enc.decode('utf-8')

Is there a similar way to decode these strings in Swift 5?

CodePudding user response:

As far as I know there's no native Swift solution to this. To make it look as compact as the Python version at the call site you can build an extension on String to hide the complexity

extension String {
   func replacingUtf8Literals() -> Self {

      let regex = #"(\\x[a-zAZ0-9]{2}) "#
      
      var str = self
      
      while let range = str.range(of: regex, options: .regularExpression) {
         let literalbytes = str[range]
            .components(separatedBy: "\\x")
            .dropFirst()
            .compactMap{UInt8($0, radix: 16)}
         guard let actuals = String(bytes: literalbytes, encoding: .utf8) else {
            fatalError("Regex error")
         }
         str.replaceSubrange(range, with: actuals)
      }
      return str
   }
}

This lets you call

print(s.replacingUtf8Literals()). 

//prints: This is a WiFi Name æ including UTF-8 literals ø

For convenience I'm trapping a failed conversion with fatalError. You may want to handle this in a better way in production code (although, unless the regex is wrong it should never occur!). There needs to be some form of break or error thrown here else you have an infinite loop.

CodePudding user response:

To start with add the decoding code into a String extension as a computed property (or create a function)

extension String {
    var decodeUTF8: String {
        let bytes = self.components(separatedBy: "\\x")
            .dropFirst()
            .compactMap { UInt8($0, radix: 16) }
        return String(bytes: bytes, encoding: .utf8) ?? self
    }
}

Then use a regular expression and match using a while loop to replace all matching values

while let range = string.range(of: #"(\\x[a-f0-9]{2}){2}"#, options: [.regularExpression, .caseInsensitive]) {
    string.replaceSubrange(range, with: String(string[range]).decodeUTF8)
}
  • Related