Home > Enterprise >  Pytest : compare two json files
Pytest : compare two json files

Time:10-22

I have an API that creates a JSON file, like below:

"tesla_2.0": {
        "kind": "Auto",
        "tar_path": "/home/scripts/project_2/tesla_2.0.zip",
        "version": "2.0",
        "yaml_path": "/home/scripts/project_2/test.yaml",
        "name": "tesla"
    }

Since I'm reading it from a file, I use json.load() that will lose the order of the saved object unless I tell it to load into an OrderedDict().

is there a simple and efficient way to compare the to files ?

 def compare_json_files(file_1, file_2):
        if not os.path.isfile(file_1):
            raise FileNotFoundError("File not found: {}".format(file_1))
        if not os.path.isfile(file_2):
            raise FileNotFoundError("File not found: {}".format(file_2))
        with open(file_1, 'r') as f1:
            data_1 = json.loads(f1)
        with open(file_2, 'r') as f2:
            data_2 = json.loads(f2)
        comparison operation

Python version : 3.5.2

CodePudding user response:

I do believe you could check every keys and values. You should first check the set of keys are equal in both sides, then key by key comparison will make sense.

assert(data_1.keys() == data_2.keys())
err_log = [['Err log:']] 
for k, v in data_1.items():
    try:
        assert(v == data_2[k])
    except:
        err_log.append(['Error catched for key=', k, ', data_1 value=', v, ', data_2 value=', data_2[k]])
[print(str(e)) for e in err_log]

Edit 3: Tested on very big dictionaries.

Best results are obtained with itemgetter of sorted list of keys for very large dictionaries.

Iterating through all keys of dictionaries is the worst. Iterating with ordered list of keys seems to perform slightly better.

Results:

  • n = 100:
    • 4e-3 seconds
    • 3e-3 seconds
    • 1.5e-3 seconds
    • 1.9e-3 seconds
  • n = 1,000,000:
    • 8.9 seconds
    • 9.2 seconds
    • 7.2 seconds
    • 6.9 seconds
  • n = 10,000,000:
    • 143 seconds
    • 130 seconds
    • 115 seconds
    • 99 seconds
from copy import deepcopy
from time import time
from operator import itemgetter

n = 10000000
v = {"stuff": "here", "and": "there"}
data_1 = {str(k): deepcopy(v) for k in range(0, n)}
data_2 = {str(k): deepcopy(v) for k in range(n-1, -1, -1)}

def get_time(f):
    def _(*args, **kwargs):
        t_0 = time()
        for x in range(10):
            f(*args, **kwargs)
        return time() - t_0
    return _

def with_dict_keys(d):
    return d.keys()

def with_sorted_dict_keys(d):
    return sorted(d.keys())

@get_time
def order_n_compare(key_func, d, d_):
    k_d, k_d_ = key_func(d), key_func(d_)
    assert(k_d == k_d_)
    for k in k_d:
        assert(d[k] == d_[k])


@get_time
def itemgetter_compare(key_func, d, d_):
    k_d, k_d_ = key_func(d), key_func(d_)
    assert(k_d == k_d_)
    assert(itemgetter(*k_d)(d) == itemgetter(*k_d)(d_))
Edit 0: added try & except block to print out where assertions are wrong
Edit 1: fixed minor bug
Edit 2: Check computation time: the dictionnary.keys() operation is irrelevant over iterating through all keys in data_1.items() because it grows order n. So it's not really necessary to optimize it.
  • Note: If sorting dict.keys() is order(log(n)) then the operation time of getting dict.keys() seems to be order log(n) too.
  • Related