How to detect if user selected .txt file is Unicode/UTF-8 format and Convert to ANSI-CodePudding

My non-Unicode Delphi 7 application allows users to open .txt files.

Sometimes UTF-8/UNICODE .txt files are tried to be opened causing a problem.

I need a function that detects if the user is opening a txt file with UTF-8 or Unicode encoding and Converts it to the system's default code page (ANSI) encoding automatically when possible so that it can be used by the app.

In cases when converting is not possible, the function should return an error.

The ReturnAsAnsiText(filename) function should open the txt file, make detection and conversion in steps like this;

If the byte stream has no bytes values over x7F, its ANSI, return as is
If the byte stream has bytes values over x7F, convert from UTF-8
If the stream has BOM; try Unicode conversion
If conversion to the system's current code page is not possible, return NULL to indicate an error.

It will be an OK limit for this function, that the user can open only those files that match their region/codepage (Control Panel Regional Region Settings for non-Unicode apps).

CodePudding user response：

The conversion function ReturnAsAnsiText, as you designed, will have a number of issues:

The Delphi 7 application may not be able to open files where the filename using UTF-8 or UTF-16.
UTF-8 (and other Unicode) usage has increased significantly from 2019. Current web pages are between 98% and 100% UTF-8 depending on the language.
You design will incorrectly translate some text that a standards compliant would handle.

Creating the ReturnAsAnsiText is beyond the scope of an answer, but you should look at locating a library you can use instead of creating a new function. I haven't used Delphi 2005 (I believe that is 7), but I found this MIT licensed library that may get you there. It has a number of caveats:

It doesn't support all forms of BOM.
It doesn't support all encodings.
There is no universal "best-fit" behavior for single-byte character sets.

There are other issues that are tangentially described in this question. You wouldn't use an external command, but I used one here to demonstrate the point:

% iconv -f utf-8 -t ascii//TRANSLIT < hello.utf8
^h'elloe
iconv: (stdin):1:6: cannot convert
% iconv -f utf-8 -t ascii < hello.utf8
iconv: (stdin):1:0: cannot convert

Enabling TRANSLIT in standards based libraries supports converting characters like é to ASCII e. But still fails on characters like π, since there are no similar in form ASCII characters.

CodePudding user response：

Your required answer would need massive UTF-8 and UTF-16 translation tables for every supported code page and BMP, and would still be unable to reliably detect the source encoding.

Notepad has trouble with this issue.

The solution as requested, would probably entail more effort than you put into the original program.

Possible solutions

Add a text editor into your program. If you write it, you will be able to read it.

The following solution pushes the translation to established tables provided by Windows.

Use the Win32 API native calls translate strings using functions like WideCharToMultiByte, but even this has its drawbacks(from the referenced page, the note is more relevant to the topic, but the caution is important for security):

Caution Using the WideCharToMultiByte function incorrectly can compromise the security of your application. Calling this function can easily cause a buffer overrun because the size of the input buffer indicated by lpWideCharStr equals the number of characters in the Unicode string, while the size of the output buffer indicated by lpMultiByteStr equals the number of bytes. To avoid a buffer overrun, your application must specify a buffer size appropriate for the data type the buffer receives.

Data converted from UTF-16 to non-Unicode encodings is subject to data loss, because a code page might not be able to represent every character used in the specific Unicode data. For more information, see Security Considerations: International Features.

Note The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode. If using Unicode is not possible, applications should tag the data stream with the appropriate encoding name when protocols allow it. HTML and XML files allow tagging, but text files do not.

This solution still has the guess the encoding problem, but if a BOM is present, this is one of the best translators possible.

Simply require the text file to be saved in the local code page.

Other thoughts:

ANSI, ASCII, and UTF-8 are all separate encodings above 127 and the control characters are handled differently.

In UTF-16 every other byte(zero first) of ASCII encoded text is 0. This is not covered in your "rules".

You simply have to search for the Turkish i to understand the complexities of Unicode translations and comparisons.

Leverage any expectations of the file contents to establish a coherent baseline comparison to make an educated guess.

For example, if it is a .csv file, find a comma in the various formats...

Bottom Line

There is no perfect general solution, only specific solutions tailored to your specific needs, which were extremely broad in the question.