Is there a function f :: Text -> Maybe ByteString
such that forall x
:
f (decodeLatin1 x) == Just x
Note, decodeLatin1
has the signature:
decodeLatin1 :: ByteString -> Text
I'm concerned that encodeUtf8
is not what I want, as I'm guessing what it does is just dump the UTF-8 string out as a ByteString, not reverse the changes that decodeLatin1
made on the way in to characters in the upper half of the character set.
I understand that f
has to return a Maybe
, because in general there's Unicode characters that aren't in the Latin character set, but I just want this to round trip at least, in that if we start with a ByteString
we should get back to it.
CodePudding user response:
DISCLAIMER: consider this a long comment rather than a solution, because I haven't tested.
I think you can do it with witch
library. It is a general purpose type converter library with a fair amount of type safety. There is a type class called TryFrom to perform conversion between types that might fail to cast.
Luckily witch
provides conversions from/to encondings too, having an instance TryFrom Text (ISO_8859_1 ByteString)
, meaning that you can convert between Text
and latin1 encoded ByteString
. So I think (not tested!!) this should work
{-# LANGUAGE TypeApplications #-}
import Witch (tryInto, ISO_8859_1)
import Data.Tagged (Tagged(unTagged))
f :: Text -> Maybe ByteString
f s = case tryInto @(ISO_8859_1 ByteString) s of
Left err -> Nothing
Right bs -> Just (unTagged bs)
Notice that tryInto
returns a Either TryFromException s
, so if you want to handle errors you can do it with Either
. Up to you.
Also, witch
docs points out that this conversion is done via String
type, so probably there is an out-of-the-box solution without the need of depending on witch
package. I don't know such a solution, and looking to the source code hasn't helped
Edit:
Having read witch
source code aparently this should work
import qualified Data.Text as T
import Data.Char (isLatin1)
import qualified Data.ByteString.Char8 as C
f :: Text -> Maybe ByteString
f t = if allCharsAreLatin then Just (C.pack str) else Nothing
where str = T.unpack t
allCharsAreLatin = all isLatin1 str
CodePudding user response:
The latin1 encoding is pretty damn simple -- codepoint X maps to byte X, whenever that's in range of a byte. So just unpack and repack immediately.
import Control.Monad
import qualified Data.Text as T
import qualified Data.ByteString.Char8 as BS
latin1EncodeText :: T.Text -> Maybe BS.ByteString
latin1EncodeText t = BS.pack (T.unpack t) <$ guard (T.all (<'\256') t)
It's possible to avoid the intermediate String
, but you should probably make sure this is your bottleneck before trying for that.