Home > Software design >  Inverse of `Data.Text.Encoding.decodeLatin1`?
Inverse of `Data.Text.Encoding.decodeLatin1`?

Time:01-19

Is there a function f :: Text -> Maybe ByteString such that forall x:

f (decodeLatin1 x) == Just x

Note, decodeLatin1 has the signature:

decodeLatin1 :: ByteString -> Text

I'm concerned that encodeUtf8 is not what I want, as I'm guessing what it does is just dump the UTF-8 string out as a ByteString, not reverse the changes that decodeLatin1 made on the way in to characters in the upper half of the character set.

I understand that f has to return a Maybe, because in general there's Unicode characters that aren't in the Latin character set, but I just want this to round trip at least, in that if we start with a ByteString we should get back to it.

CodePudding user response:

DISCLAIMER: consider this a long comment rather than a solution, because I haven't tested.

I think you can do it with witch library. It is a general purpose type converter library with a fair amount of type safety. There is a type class called TryFrom to perform conversion between types that might fail to cast.

Luckily witch provides conversions from/to encondings too, having an instance TryFrom Text (ISO_8859_1 ByteString), meaning that you can convert between Text and latin1 encoded ByteString. So I think (not tested!!) this should work

{-# LANGUAGE TypeApplications #-}

import Witch (tryInto, ISO_8859_1)
import Data.Tagged (Tagged(unTagged))

f :: Text -> Maybe ByteString
f s = case tryInto @(ISO_8859_1 ByteString) s of
  Left err -> Nothing
  Right bs -> Just (unTagged bs)

Notice that tryInto returns a Either TryFromException s, so if you want to handle errors you can do it with Either. Up to you.

Also, witch docs points out that this conversion is done via String type, so probably there is an out-of-the-box solution without the need of depending on witch package. I don't know such a solution, and looking to the source code hasn't helped

Edit:

Having read witch source code aparently this should work

import qualified Data.Text as T
import Data.Char (isLatin1)
import qualified Data.ByteString.Char8 as C

f :: Text -> Maybe ByteString
f t = if allCharsAreLatin then Just (C.pack str) else Nothing
 where str = T.unpack t
       allCharsAreLatin = all isLatin1 str

CodePudding user response:

The latin1 encoding is pretty damn simple -- codepoint X maps to byte X, whenever that's in range of a byte. So just unpack and repack immediately.

import Control.Monad
import qualified Data.Text as T
import qualified Data.ByteString.Char8 as BS

latin1EncodeText :: T.Text -> Maybe BS.ByteString
latin1EncodeText t = BS.pack (T.unpack t) <$ guard (T.all (<'\256') t)

It's possible to avoid the intermediate String, but you should probably make sure this is your bottleneck before trying for that.

  • Related