Why the parsed dicts are equal while the pickled dicts are not?-CodePudding

I'm working on an aggregated config file parsing tool, hoping it can support .json, .yaml and .toml files. So, I have done the next tests:

The example.json config file is as:

{
  "DEFAULT":
  {
    "ServerAliveInterval": 45,
    "Compression": true,
    "CompressionLevel": 9,
    "ForwardX11": true
  },
  "bitbucket.org":
    {
      "User": "hg"
    },
  "topsecret.server.com":
    {
      "Port": 50022,
      "ForwardX11": false
    },
  "special":
    {
      "path":"C:\\Users",
      "escaped1":"\n\t",
      "escaped2":"\\n\\t"
    }  
}

The example.yaml config file is as:

DEFAULT:
  ServerAliveInterval: 45
  Compression: yes
  CompressionLevel: 9
  ForwardX11: yes
bitbucket.org:
  User: hg
topsecret.server.com:
  Port: 50022
  ForwardX11: no
special:
  path: C:\Users
  escaped1: "\n\t"
  escaped2: \n\t

and the example.toml config file is as:

[DEFAULT]
ServerAliveInterval = 45
Compression = true
CompressionLevel = 9
ForwardX11 = true
['bitbucket.org']
User = 'hg'
['topsecret.server.com']
Port = 50022
ForwardX11 = false
[special]
path = 'C:\Users'
escaped1 = "\n\t"
escaped2 = '\n\t'

Then, the test code with output is as:

import pickle,json,yaml
# TOML, see https://github.com/hukkin/tomli
try:
    import tomllib
except ModuleNotFoundError:
    import tomli as tomllib

path = "example.json"
with open(path) as file:
    config1 = json.load(file)
    assert isinstance(config1,dict)
    pickled1 = pickle.dumps(config1)

path = "example.yaml"
with open(path, 'r', encoding='utf-8') as file:
    config2 = yaml.safe_load(file)
    assert isinstance(config2,dict)
    pickled2 = pickle.dumps(config2)

path = "example.toml"
with open(path, 'rb') as file:
    config3 = tomllib.load(file)
    assert isinstance(config3,dict)
    pickled3 = pickle.dumps(config3)

print(config1==config2) # True
print(config2==config3) # True
print(pickled1==pickled2) # False
print(pickled2==pickled3) # True

So, my question is, since the parsed obj are all dicts, and these dicts are equal to each other, why their pickled codes are not the same, i.e., why is the pickled code of the dict parsed from json different to other two?

Thanks in advance.

CodePudding user response：

The difference is due to:

The json module is performing memoizing for object attributes with the same value (it's not interning them, but the scanner object contains a memo dict that it uses to dedupe identical attribute strings within a single parsing run), while yaml does not (it just makes a new str each time it sees the same data), and
pickle faithfully reproducing the exact structure of the data it's told to dump, replacing subsequent references to the same object with a back-reference to the first time it was seen (among other reasons, this makes it possible to dump recursive data structures, e.g. lst = [], lst.append(lst), without infinite recursion, and reproduce them faithfully when unpickled)

Issue #1 isn't visible in equality testing (strs compare equal with the same data, not just the same exact object in memory). But when pickle sees "ForwardX11" the first time, it inserts the pickled form of the object and emits a pickle opcode that assigns a number to that object. If that exact object is seen again (same memory address, not merely same value), instead of reserializing it, it just emits a simpler opcode that just says "Go find the object associated with the number from last time and put it here as well". If it's a different object though, even one with the same value, it's new, and gets serialized separately (and assigned another number in case the new object is seen again).

Simplifying your code to demonstrate the issue, you can inspect the generated pickle output to see how this is happening:

s = r'''{
  "DEFAULT":
  {
    "ForwardX11": true
  },
  "FOO":
    {
      "ForwardX11": false
    }
}'''

s2 = r'''DEFAULT:
  ForwardX11: yes
FOO:
  ForwardX11: no
'''

import io, json, yaml, pickle, pickletools

d1 = json.load(io.StringIO(s))
d2 = yaml.safe_load(io.StringIO(s2))
pickletools.dis(pickle.dumps(d1))
pickletools.dis(pickle.dumps(d2))

Try it online!

The output from that code for the json parsed input is (with # comments inline to point out important things), at least on Python 3.7 (the default pickle protocol and exact pickling format can change from release to release), is:

    0: \x80 PROTO      3
    2: }    EMPTY_DICT
    3: q    BINPUT     0
    5: (    MARK
    6: X        BINUNICODE 'DEFAULT'
   18: q        BINPUT     1
   20: }        EMPTY_DICT
   21: q        BINPUT     2
   23: X        BINUNICODE 'ForwardX11'      # Serializes 'ForwardX11'
   38: q        BINPUT     3                 # Assigns the serialized form the ID of 3
   40: \x88     NEWTRUE
   41: s        SETITEM
   42: X        BINUNICODE 'FOO'
   50: q        BINPUT     4
   52: }        EMPTY_DICT
   53: q        BINPUT     5
   55: h        BINGET     3                 # Looks up whatever object was assigned the ID of 3
   57: \x89     NEWFALSE
   58: s        SETITEM
   59: u        SETITEMS   (MARK at 5)
   60: .    STOP
highest protocol among opcodes = 2

while the output from the yaml loaded data is:

    0: \x80 PROTO      3
    2: }    EMPTY_DICT
    3: q    BINPUT     0
    5: (    MARK
    6: X        BINUNICODE 'DEFAULT'
   18: q        BINPUT     1
   20: }        EMPTY_DICT
   21: q        BINPUT     2
   23: X        BINUNICODE 'ForwardX11'   # Serializes as before
   38: q        BINPUT     3              # and assigns code 3 as before
   40: \x88     NEWTRUE
   41: s        SETITEM
   42: X        BINUNICODE 'FOO'
   50: q        BINPUT     4
   52: }        EMPTY_DICT
   53: q        BINPUT     5
   55: X        BINUNICODE 'ForwardX11'   # Doesn't see this 'ForwardX11' as being the exact same object, so reserializes
   70: q        BINPUT     6              # and marks again, in case this copy is seen again
   72: \x89     NEWFALSE
   73: s        SETITEM
   74: u        SETITEMS   (MARK at 5)
   75: .    STOP
highest protocol among opcodes = 2

printing the id of each such string would get you similar information, e.g., replacing the pickletools lines with:

for k in d1['DEFAULT']:
    print(id(k))
for k in d1['FOO']:
    print(id(k))

for k in d2['DEFAULT']:
    print(id(k))
for k in d2['FOO']:
    print(id(k))

will show a consistent id for both 'ForwardX11's in d1, but differing ones for d2; a sample run produced (with inline comments added):

140067902240944   # First from d1
140067902240944   # Second from d1 is *same* object
140067900619760   # First from d2
140067900617712   # Second from d2 is unrelated object (same value, but stored separately)

While I didn't bother checking if toml behaved the same way, given that it pickles the same as the yaml, it's clearly not attempting to dedupe strings; json is uniquely weird there. It's not a terrible idea that it does so mind you; the keys of a JSON dict are logically equivalent to attributes on an object, and for huge inputs (say, 10M objects in an array with the same handful of keys), it might save a meaningful amount of memory on the final parsed output by deduping (e.g. on CPython 3.11 x86-64 builds, replacing 10M copies of "ForwardX11" with a single copy would reduce 59 MB for string data to just 59 bytes).

As a side-note: This "dicts are equal, pickles are not" issue could also occur:

When the two dicts were constructed with the same keys and values, but the order in which the keys were inserted differed (modern Python uses insertion-ordered dicts; comparisons between them ignore ordering, but pickle would be serializing them in whatever order they iterate in naturally).
When there are objects which compare equal but have different types (e.g. set vs. frozenset, int vs. float); pickle would treat them separately, but equality tests would not see a difference.

Neither of these is the issue here (both json and yaml appear to be constructing in the same order seen in the input, and they're parsing the ints as ints), but it's entirely possible for your test of equality to return True, while the pickled forms are unequal, even when all the objects involved are unique.

CodePudding user response：

Have you evaluated the dtypes being created? Possibly there is a mixture str and int dtypes. Good luck!