I am trying with below code to convert from shift-jis file to utf-8, but when we open the output file it has corrupted characters, looks like something is missed out here, any thoughts?
// From file
FILE* shiftJisFile = _tfopen(lpszShiftJs, _T("rb"));
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen];
fread(lpszBuf, 1, nLen, shiftJisFile);
// convert multibyte to wide char
int utf16size = ::MultiByteToWideChar(CP_ACP, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(CP_ACP, 0, lpszBuf, -1, pUTF16, utf16size);
wstring str(pUTF16);
// convert wide char to multi byte utf-8 before writing to a file
fstream File("filepath", std::ios::out);
string result = string();
result.resize(WideCharToMultiByte(CP_UTF8, 0, str.c_str(), -1, NULL, 0, 0, 0));
char* ptr = &result[0];
WideCharToMultiByte(CP_UTF8, 0, str.c_str(), -1, ptr, result.size(), 0, 0);
File << result;
File.close();
CodePudding user response:
There are multiple problems.
The first problem is that when you are writing the output file, you need to set it to binary
for the same reason you need to do so when reading the input.
fstream File("filepath", std::ios::out | std::ios::binary);
The second problem is that when you are reading the input file, you are only reading the bytes of the input stream and treat them like a string. However, those bytes do not have a terminating null character. If you call MultiByteToWideChar
with a -1
length, it infers the input string length from the terminating null character, which is missing in your case. That means both utf16size
and the contents of pUTF16
are already wrong. Add it manually after reading the file:
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen 1];
fread(lpszBuf, 1, nLen, shiftJisFile);
lpszBuf[nLen] = 0;
The last problem is that you are using CP_ACP
. That means "the current code page". In your question, you were specifically asking how to convert Shift-JIS. The code page Windows uses for its closes equivalent to what is commonly called "Shift-JIS" is 932
(you can look that up on wikipedia for example). So use 932
instead of CP_ACP
:
int utf16size = ::MultiByteToWideChar(932, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(932, 0, lpszBuf, -1, pUTF16, utf16size);
Additionally, there is no reason to create wstring str(pUTF16)
. Just use pUTF16
directly in the WideCharToMultiByte
calls.
Also, I'm not sure how kosher char *ptr = &result[0]
is. I personally would not create a string specifically as a buffer for this.
Here is the corrected code. I would personally not write it this way, but I don't want to impose my coding ideology on you, so I made only the changes necessary to fix it:
// From file
FILE* shiftJisFile = _tfopen(lpszShiftJs, _T("rb"));
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen 1];
fread(lpszBuf, 1, nLen, shiftJisFile);
lpszBuf[nLen] = 0;
// convert multibyte to wide char
int utf16size = ::MultiByteToWideChar(932, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(932, 0, lpszBuf, -1, pUTF16, utf16size);
// convert wide char to multi byte utf-8 before writing to a file
fstream File("filepath", std::ios::out | std::ios::binary);
string result;
result.resize(WideCharToMultiByte(CP_UTF8, 0, pUTF16, -1, NULL, 0, 0, 0));
char *ptr = &result[0];
WideCharToMultiByte(CP_UTF8, 0, pUTF16, -1, ptr, result.size(), 0, 0);
File << ptr;
File.close();
Also, you have a memory leak -- lpszBuf
and pUTF16
are not cleaned up.
CodePudding user response:
You should try use std::locale
to perform this conversion:
namespace fs = std::filesystem;
void convert(const fs::path inName, const fs::path outName)
{
std::wifstream in{inName};
in.imbue(std::locale{".932"}); // or "ja_JP.SJIS"
if (in) {
std::wofstream out{outName};
out.imbue(std::locale{".utf-8"});
std::wstring line;
while (getline(in, line)) {
out << line << L'\n';
}
}
}
Note locale names are platform specific - I think I used proper one for Windows.
Update: I've tested this on my Window 10 machine with MSVC 19.29.30145 and works perfectly. I used
Note I used similar method here for Korean and it worked nicely.
CodePudding user response:
wstring str(pUTF16);
- pUTF16
there does not end with zero char. It should be wstring str(pUTF16, utf16size);