I am having some trouble with encoding special characters and parsing XML in a big complex project in Matlab. I have isolated the problem and though I still don't know how to solve it, I think it is related to this question.
Consider the following XML file:
<?xml version="1.0"?>
<info>
<desc>
<channels>
<channel>
<label>accZ</label>
<unit>mg₀</unit>
<type>ACC</type>
</channel>
</channels>
</desc>
</info>
Assuming Unix style line endings, this file contains 175 bytes of data. Indeed that is what it says when I open it in Notepad . Now I have an XML parsing function that I copied almost exactly from Mathworks' explanation of how to parse XML in Matlab (https://au.mathworks.com/help/matlab/ref/xmlread.html). This function works very well and is not the issue, it is included only for completeness' sake:
% parse a simplified (attribute-free) subset of XML into a MATLAB struct
function result = parse_xml_struct(str)
import org.xml.sax.InputSource
import javax.xml.parsers.*
import java.io.*
tmp = InputSource();
tmp.setCharacterStream(StringReader(str));
result = parseChildNodes(xmlread(tmp));
% this is part of xml2struct (slightly simplified)
function [children,ptext] = parseChildNodes(theNode)
% Recurse over node children.
children = struct;
ptext = [];
if theNode.hasChildNodes
childNodes = theNode.getChildNodes;
numChildNodes = childNodes.getLength;
for count = 1:numChildNodes
theChild = childNodes.item(count-1);
[text,name,childs] = getNodeData(theChild);
if (~strcmp(name,'#text') && ~strcmp(name,'#comment'))
if (isfield(children,name))
if (~iscell(children.(name)))
children.(name) = {children.(name)}; end
index = length(children.(name)) 1;
children.(name){index} = childs;
if(~isempty(text))
children.(name){index} = text; end
else
children.(name) = childs;
if(~isempty(text))
children.(name) = text; end
end
elseif (strcmp(name,'#text'))
if (~isempty(regexprep(text,'[\s]*','')))
if (isempty(ptext))
ptext = text;
else
ptext = [ptext text];
end
end
end
end
end
end
% this is part of xml2struct (slightly simplified)
function [text,name,childs] = getNodeData(theNode)
% Create structure of node info.
name = char(theNode.getNodeName);
if ~isvarname(name)
name = regexprep(name,'[-]','_dash_');
name = regexprep(name,'[:]','_colon_');
name = regexprep(name,'[.]','_dot_');
end
[childs,text] = parseChildNodes(theNode);
if (isempty(fieldnames(childs)))
try
text = char(theNode.getData);
catch
end
end
end
end
Now to test it:
finfo = dir('xml_example');
sz = finfo.bytes
fid = fopen('xml_example', 'r', 'ieee-le.l64');
data = fread(fid, sz, '*char');
data_size = size(data)
h = parse_xml_struct(data);
unit = h.info.desc.channels.channel.unit
and the output:
sz =
175
data_size =
173 1
unit =
'mg₀'
So somehow I end up with the right output, but lost 2 bytes along the way. I do not understand why this is happening.
And just to prove to myself that it is the little subscript 'o' that is causing the discrepancy between the size of the file and the number of bytes in my data
array, I delete it from the XML file and get the following:
sz =
172
data_size =
172 1
unit =
'mg'
Still the right output for the xml label and now the file size and the byte array size match. What's the deal?
Update
Furthermore, if I run the same test on a symbol that is 2-bytes long, I still get the compression phenomenon.
<?xml version="1.0"?>
<info>
<desc>
<channels>
<channel>
<label>ωX</label>
<unit>mrad/s</unit>
<type>AUX</type>
</channel>
</channels>
</desc>
</info>
Output:
sz =
175
data_size =
174 1
unit =
'ωX'
CodePudding user response:
The ₀
symbol is three bytes long with a UTF-8 encoding (0xE2 0x82 0x80
). And internal to MATLAB, it's actually two bytes due to a UTF-16 encoding (0x80 0x20
little-endian).
However, since a precision
of *char
was given to fread
, the returned data is returned as a char
array1. And to a char
array, regardless of the underlying encoding, ₀
is simply a single character when considering its size
:
If
A
is a character vector of typechar
, thensize
returns the row vector[1 M]
whereM
is the number of characters.
1 I'm assuming the strong assertion of matching matrix size from fread
if sizeA
is given (per the documentation) is only valid for numeric data since, as may be evident above, byte count and character count are not necessarily one-to-one.