Why is my byte array returned from fread smaller than the number of bytes in the file itself when it-CodePudding

I am having some trouble with encoding special characters and parsing XML in a big complex project in Matlab. I have isolated the problem and though I still don't know how to solve it, I think it is related to this question.

Consider the following XML file:

<?xml version="1.0"?>
<info>
    <desc>
        <channels>
            <channel>
                <label>accZ</label>
                <unit>mg₀</unit>
                <type>ACC</type>
            </channel>
        </channels>
    </desc>
</info>

Assuming Unix style line endings, this file contains 175 bytes of data. Indeed that is what it says when I open it in Notepad . Now I have an XML parsing function that I copied almost exactly from Mathworks' explanation of how to parse XML in Matlab (https://au.mathworks.com/help/matlab/ref/xmlread.html). This function works very well and is not the issue, it is included only for completeness' sake:

% parse a simplified (attribute-free) subset of XML into a MATLAB struct
function result = parse_xml_struct(str)
import org.xml.sax.InputSource
import javax.xml.parsers.*
import java.io.*
tmp = InputSource();
tmp.setCharacterStream(StringReader(str));
result = parseChildNodes(xmlread(tmp));

% this is part of xml2struct (slightly simplified)
    function [children,ptext] = parseChildNodes(theNode)
        % Recurse over node children.
        children = struct;
        ptext = [];
        if theNode.hasChildNodes
            childNodes = theNode.getChildNodes;
            numChildNodes = childNodes.getLength;
            for count = 1:numChildNodes
                theChild = childNodes.item(count-1);
                [text,name,childs] = getNodeData(theChild);
                if (~strcmp(name,'#text') && ~strcmp(name,'#comment'))
                    if (isfield(children,name))
                        if (~iscell(children.(name)))
                            children.(name) = {children.(name)}; end
                        index = length(children.(name)) 1;
                        children.(name){index} = childs;
                        if(~isempty(text))
                            children.(name){index} = text; end
                    else
                        children.(name) = childs;
                        if(~isempty(text))
                            children.(name) = text; end
                    end
                elseif (strcmp(name,'#text'))
                    if (~isempty(regexprep(text,'[\s]*','')))
                        if (isempty(ptext))
                            ptext = text;
                        else
                            ptext = [ptext text];
                        end
                    end
                end
            end
        end
    end

% this is part of xml2struct (slightly simplified)
    function [text,name,childs] = getNodeData(theNode)
        % Create structure of node info.
        name = char(theNode.getNodeName);
        if ~isvarname(name)
            name = regexprep(name,'[-]','_dash_');
            name = regexprep(name,'[:]','_colon_');
            name = regexprep(name,'[.]','_dot_');
        end
        [childs,text] = parseChildNodes(theNode);
        if (isempty(fieldnames(childs)))
            try
                text = char(theNode.getData);
            catch
            end
        end
    end
end

Now to test it:

finfo = dir('xml_example');
sz = finfo.bytes
fid = fopen('xml_example', 'r', 'ieee-le.l64');
data = fread(fid, sz, '*char');
data_size = size(data)
h = parse_xml_struct(data);
unit = h.info.desc.channels.channel.unit

and the output:

sz =

   175


data_size =

   173     1


unit =

    'mg₀'

So somehow I end up with the right output, but lost 2 bytes along the way. I do not understand why this is happening.

And just to prove to myself that it is the little subscript 'o' that is causing the discrepancy between the size of the file and the number of bytes in my data array, I delete it from the XML file and get the following:

sz =

   172


data_size =

   172     1


unit =

    'mg'

Still the right output for the xml label and now the file size and the byte array size match. What's the deal?

Update

Furthermore, if I run the same test on a symbol that is 2-bytes long, I still get the compression phenomenon.

<?xml version="1.0"?>
<info>
    <desc>
        <channels>
            <channel>
                <label>ωX</label>
                <unit>mrad/s</unit>
                <type>AUX</type>
            </channel>
        </channels>
    </desc>
</info>

Output:

sz =

   175


data_size =

   174     1


unit =

'ωX'

CodePudding user response：

The ₀ symbol is three bytes long with a UTF-8 encoding (0xE2 0x82 0x80). And internal to MATLAB, it's actually two bytes due to a UTF-16 encoding (0x80 0x20 little-endian).

However, since a precision of *char was given to fread, the returned data is returned as a char array¹. And to a char array, regardless of the underlying encoding, ₀ is simply a single character when considering its size:

If A is a character vector of type char, then size returns the row vector [1 M] where M is the number of characters.

¹ I'm assuming the strong assertion of matching matrix size from fread if sizeA is given (per the documentation) is only valid for numeric data since, as may be evident above, byte count and character count are not necessarily one-to-one.