YAML: encoding vs semantic differences-CodePudding

I would like to get some better understanding about what aspects of YAML refer to the encoding of data vs what aspects refer to semantic.

A simple example:

test1: dGVzdDE=
test2: !!binary |
  dGVzdDE=
test3: 
- 116
- 101
- 115
- 116
- 49
test4: test1

Which of these values (if any) are equivalent?

I would argue that test1 encodes the literal string value dGVzdDE=. test2 and test3 both encode the same array, just using a different encoding. I am unsure about test4, it contains the same bytes as test2 and test3 but does this make it the equivalent value or is a string in YAML different from a byte array?

Different tools seem to produce different answers:

https://onlineyamltools.com/convert-yaml-to-json suggests that test2 and test3 are equivalent, but different from test4
https://yaml-online-parser.appspot.com/ suggests that test2 and test4 are equivalent, but different from test4
to yq all entries are different yq < test.yml:

{
  "test1": "dGVzdDE=",
  "test2": "dGVzdDE=\n",
  "test3": [
    116,
    101,
    115,
    116,
    49
  ],
  "test4": "test1"
}

What does the YAML spec intend?

CodePudding user response：

Equality

You're asking for equivalence but that's not a term in the spec and therefore cannot be discussed (at least not without definition). I'll go with discussing equality instead, which is defined by the spec as follows:

Two scalars are equal only when their tags and canonical forms are equal character-by-character. Equality of collections is defined recursively.

One node in your example has the tag !!binary but the others do not have tags. So we must check what the spec says about tags of nodes that don't have explicit tags:

Tags and Schemes

The YAML spec says that every node is to have a tag. Any node that does not have an explicit tag gets a non-specific tag assigned. Nodes are divided into scalars (that get created from textual content) and collections (sequences and mappings). Every non-plain scalar node (i.e. every scalar in quotes or given via | or >) that does not have an explicit tag gets the non-specific tag !, every other node without explicit tag gets the non-specific tag ?.

During loading, the spec defines that non-specific tags are to be resolved to specific tags by means of using a scheme. The specification describes some schemes, but does not require an implementation to support any particular one.

The failsafe scheme, which is designed to be the most basic scheme, will resolve non-specific tags as follows:

on scalars to !!str
on sequences to !!seq
on mappings to !!map

and that's it.

A scheme is allowed to derive a specific tag from a non-specific one by considering the kind of non-specific tag, the node's position in the document, and the node's content. For example, the JSON Scheme will give a scalar true the tag !!bool due to its content.

The spec says that the non-specific tag ! should only be resolved to !!str for scalars, !!seq for sequence, and !!map for mappings, but does not require this. This is what most implementations support and means that if you quote your scalar, you will get a string. This is important so that you can give the scalar "true" quoted to avoid getting a boolean value.

By the way, the spec does not say that every step defined there is to be implemented slavishly as defined in the spec, it is more a logical description. A lot of implementations do not actually transition from non-specific tags to specific tags, but instead directly choose native types for the YAML data they load according to the scheme rules.

Applying Equality

Now that we know how tags are assigned to nodes, let's go over your example:

test1: dGVzdDE=
test2: !!binary |
  dGVzdDE=

The two values are immediately not equal because even without the tag, their content differs: Literal block scalars (introduced with |) contain the final linebreak, so the value of test2 is "dGVzdEDE=\n" and therefore not equal to the test1 value. You can introduce the literal scalar with |- instead to chop the final linebreak, which I suppose is your intent. In that case, the scalar content is identical.

Now for the tag: The value of test1 is a plain scalar, hence it has a non-specific tag ?. The question is now: Will this be resolved to !!binary? There could be a scheme that does this, but the spec doesn't define one. But think about it: A scheme that assigns every scalar the tag !!binary if it looks like base64-encoded data would be a very specific one.

As for the other values: The test3 value is a sequence, so obviously not equal to any other value. The test4 value contains content not present anywhere else, therefore also not equal.

But yaml-online-parser does things!

Yes. The YAML spec explicitly states that the target of loading YAML data is native data types. Tags are thought of as generic hints that can be mapped to native data types by a specific implementation. So an !!str for example would be resolved to the target language's string type.

How this mapping to native types is done is implementation-defined (and must be, since the spec cannot cater to every language out there). yaml-online-parser uses PyYAML and what it does is to load the YAML into Python's native data types, and then dump it again. In this process, the !!binary will get loaded into a Python binary string. However, during dumping, this binary string will get interpreted as UTF-8 string and then written as plain scalar. You can argue this is a bug, but it certainly doesn't violate the spec (as the spec doesn't know what a Python binary string is and therefore does not define how it is to be represented).

In any case, this shows that as soon as you transition to native types and back again, everything goes and nothing is certain because native types are outside of the spec. Different implementations will give you different outputs because they are allowed to. !!binary is not a tag defined in the JSON scheme so even translating your input to JSON is not well-defined.

If you want an online tool that shows you canonical YAML representation without loading data into native types and back, you can use the NimYAML testing ground (my work).

Conclusion

The question of whether two YAML inputs are equal is an academic one. Since YAML does allow for different schemes, the question can only be definitely answered in the context of a certain scheme.

However, you will find very few formal scheme definitions outside of the YAML spec. Most applications that do use YAML will document their input structure in a less formal way, and most of the time without discussing YAML tags. This is fine because as discussed before, loading YAML does not need to directly implement the logical process described in the spec.

Your answer for practical purposes should come from the documentation of the application consuming the YAML data. If the documentation is very good, it will answer this, but a lot of YAML-consuming applications just use the default settings of the YAML implementation they use without telling you about this.

So the takeaway is: Know your application and know the YAML implementation it uses.