I'm working on an aggregated config file parsing tool, hoping it can support .json
, .yaml
and .toml
files. So, I have done the next tests:
The example.json
config file is as:
{
"DEFAULT":
{
"ServerAliveInterval": 45,
"Compression": true,
"CompressionLevel": 9,
"ForwardX11": true
},
"bitbucket.org":
{
"User": "hg"
},
"topsecret.server.com":
{
"Port": 50022,
"ForwardX11": false
},
"special":
{
"path":"C:\\Users",
"escaped1":"\n\t",
"escaped2":"\\n\\t"
}
}
The example.yaml
config file is as:
DEFAULT:
ServerAliveInterval: 45
Compression: yes
CompressionLevel: 9
ForwardX11: yes
bitbucket.org:
User: hg
topsecret.server.com:
Port: 50022
ForwardX11: no
special:
path: C:\Users
escaped1: "\n\t"
escaped2: \n\t
and the example.toml
config file is as:
[DEFAULT]
ServerAliveInterval = 45
Compression = true
CompressionLevel = 9
ForwardX11 = true
['bitbucket.org']
User = 'hg'
['topsecret.server.com']
Port = 50022
ForwardX11 = false
[special]
path = 'C:\Users'
escaped1 = "\n\t"
escaped2 = '\n\t'
Then, the test code with output is as:
import pickle,json,yaml
# TOML, see https://github.com/hukkin/tomli
try:
import tomllib
except ModuleNotFoundError:
import tomli as tomllib
path = "example.json"
with open(path) as file:
config1 = json.load(file)
assert isinstance(config1,dict)
pickled1 = pickle.dumps(config1)
path = "example.yaml"
with open(path, 'r', encoding='utf-8') as file:
config2 = yaml.safe_load(file)
assert isinstance(config2,dict)
pickled2 = pickle.dumps(config2)
path = "example.toml"
with open(path, 'rb') as file:
config3 = tomllib.load(file)
assert isinstance(config3,dict)
pickled3 = pickle.dumps(config3)
print(config1==config2) # True
print(config2==config3) # True
print(pickled1==pickled2) # False
print(pickled2==pickled3) # True
So, my question is, since the parsed obj are all dicts, and these dicts are equal to each other, why their pickled
codes are not the same, i.e., why is the pickled
code of the dict parsed from json
different to other two?
Thanks in advance.
CodePudding user response:
The difference is due to:
The
json
module is performing memoizing for object attributes with the same value (it's not interning them, but the scanner object contains amemo
dict
that it uses to dedupe identical attribute strings within a single parsing run), whileyaml
does not (it just makes a newstr
each time it sees the same data), andpickle
faithfully reproducing the exact structure of the data it's told to dump, replacing subsequent references to the same object with a back-reference to the first time it was seen (among other reasons, this makes it possible to dump recursive data structures, e.g.lst = []
,lst.append(lst)
, without infinite recursion, and reproduce them faithfully when unpickled)
Issue #1 isn't visible in equality testing (str
s compare equal with the same data, not just the same exact object in memory). But when pickle
sees "ForwardX11"
the first time, it inserts the pickled form of the object and emits a pickle opcode that assigns a number to that object. If that exact object is seen again (same memory address, not merely same value), instead of reserializing it, it just emits a simpler opcode that just says "Go find the object associated with the number from last time and put it here as well". If it's a different object though, even one with the same value, it's new, and gets serialized separately (and assigned another number in case the new object is seen again).
Simplifying your code to demonstrate the issue, you can inspect the generated pickle output to see how this is happening:
s = r'''{
"DEFAULT":
{
"ForwardX11": true
},
"FOO":
{
"ForwardX11": false
}
}'''
s2 = r'''DEFAULT:
ForwardX11: yes
FOO:
ForwardX11: no
'''
import io, json, yaml, pickle, pickletools
d1 = json.load(io.StringIO(s))
d2 = yaml.safe_load(io.StringIO(s2))
pickletools.dis(pickle.dumps(d1))
pickletools.dis(pickle.dumps(d2))
The output from that code for the json
parsed input is (with #
comments inline to point out important things), at least on Python 3.7 (the default pickle protocol and exact pickling format can change from release to release), is:
0: \x80 PROTO 3
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: X BINUNICODE 'DEFAULT'
18: q BINPUT 1
20: } EMPTY_DICT
21: q BINPUT 2
23: X BINUNICODE 'ForwardX11' # Serializes 'ForwardX11'
38: q BINPUT 3 # Assigns the serialized form the ID of 3
40: \x88 NEWTRUE
41: s SETITEM
42: X BINUNICODE 'FOO'
50: q BINPUT 4
52: } EMPTY_DICT
53: q BINPUT 5
55: h BINGET 3 # Looks up whatever object was assigned the ID of 3
57: \x89 NEWFALSE
58: s SETITEM
59: u SETITEMS (MARK at 5)
60: . STOP
highest protocol among opcodes = 2
while the output from the yaml
loaded data is:
0: \x80 PROTO 3
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: X BINUNICODE 'DEFAULT'
18: q BINPUT 1
20: } EMPTY_DICT
21: q BINPUT 2
23: X BINUNICODE 'ForwardX11' # Serializes as before
38: q BINPUT 3 # and assigns code 3 as before
40: \x88 NEWTRUE
41: s SETITEM
42: X BINUNICODE 'FOO'
50: q BINPUT 4
52: } EMPTY_DICT
53: q BINPUT 5
55: X BINUNICODE 'ForwardX11' # Doesn't see this 'ForwardX11' as being the exact same object, so reserializes
70: q BINPUT 6 # and marks again, in case this copy is seen again
72: \x89 NEWFALSE
73: s SETITEM
74: u SETITEMS (MARK at 5)
75: . STOP
highest protocol among opcodes = 2
print
ing the id
of each such string would get you similar information, e.g., replacing the pickletools
lines with:
for k in d1['DEFAULT']:
print(id(k))
for k in d1['FOO']:
print(id(k))
for k in d2['DEFAULT']:
print(id(k))
for k in d2['FOO']:
print(id(k))
will show a consistent id
for both 'ForwardX11'
s in d1
, but differing ones for d2
; a sample run produced (with inline comments added):
140067902240944 # First from d1
140067902240944 # Second from d1 is *same* object
140067900619760 # First from d2
140067900617712 # Second from d2 is unrelated object (same value, but stored separately)
While I didn't bother checking if toml
behaved the same way, given that it pickles the same as the yaml
, it's clearly not attempting to dedupe strings; json
is uniquely weird there. It's not a terrible idea that it does so mind you; the keys of a JSON dict
are logically equivalent to attributes on an object, and for huge inputs (say, 10M objects in an array with the same handful of keys), it might save a meaningful amount of memory on the final parsed output by deduping (e.g. on CPython 3.11 x86-64 builds, replacing 10M copies of "ForwardX11"
with a single copy would reduce 59 MB for string data to just 59 bytes).
As a side-note: This "dict
s are equal, pickles are not" issue could also occur:
- When the two
dict
s were constructed with the same keys and values, but the order in which the keys were inserted differed (modern Python uses insertion-ordereddict
s; comparisons between them ignore ordering, butpickle
would be serializing them in whatever order they iterate in naturally). - When there are objects which compare equal but have different types (e.g.
set
vs.frozenset
,int
vs.float
);pickle
would treat them separately, but equality tests would not see a difference.
Neither of these is the issue here (both json
and yaml
appear to be constructing in the same order seen in the input, and they're parsing the int
s as int
s), but it's entirely possible for your test of equality to return True
, while the pickled forms are unequal, even when all the objects involved are unique.
CodePudding user response:
Have you evaluated the dtypes being created? Possibly there is a mixture str and int dtypes. Good luck!