Parse a file which is a lists of objects in Python-CodePudding

I have a json-like file in the below format, I would like to store the BLEU score attribute in a list and the chrF2 score in another list.

The file format:

[
{
 "name": "BLEU",
 "score": 38.8,
 "signature": "nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "75.0/45.5/30.0/22.2 (BP = 1.000 ratio = 1.000 hyp_len = 12 ref_len = 12)",
 "nrefs": "1",
 "case": "lc",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
},
{
 "name": "chrF2  ",
 "score": 49.6,
 "signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0",
 "nrefs": "1",
 "case": "mixed",
 "eff": "yes",
 "nc": "6",
 "nw": "2",
 "space": "no",
 "version": "2.0.0"
}
]
[
{
 "name": "BLEU",
 "score": 19.2,
 "signature": "nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "61.5/33.3/18.2/5.0 (BP = 0.926 ratio = 0.929 hyp_len = 13 ref_len = 14)",
 "nrefs": "1",
 "case": "lc",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
},
{
 "name": "chrF2  ",
 "score": 38.8,
 "signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0",
 "nrefs": "1",
 "case": "mixed",
 "eff": "yes",
 "nc": "6",
 "nw": "2",
 "space": "no",
 "version": "2.0.0"
}
]
....

I tried:

with open(sys.argv[1]) as f:
    for jsonObj in f:
        list_of_scores = json.loads(jsonObj)
        print(list_of_scores)
        bleuScores.append(list_of_scores[0])
        chrfScores.append(list_of_scores[1])

but it did not work

CodePudding user response：

Since your file does not seem to be a valid JSON file, therefore I would like to manipulate this file to reformat it into a valid JSON file. After that, you can simply use a for loop to get the desired lists:

import json
with open(sys.argv[1]) as f:
  text = f.read()
  text = text.replace("[", "").replace("]", "").replace("}", "},") \
  .replace("},,", "},").strip().strip(",")
  text = "["   text   "]"
  myDictionary = json.loads(text)

bleus = []
chrs = []
for value in myDictionary:
  if value["name"] == "BLEU":
    bleus.append(value)
  elif value["name"] == "chrF2  ":
    chrs.append(value)
print(bleus)
print(chrs)

Output

[{'name': 'BLEU', 'score': 38.8, 'signature': 'nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0', 'verbose_score': '75.0/45.5/30.0/22.2 (BP = 1.000 ratio = 1.000 hyp_len = 12 ref_len = 12)', 'nrefs': '1', 'case': 'lc', 'eff': 'no', 'tok': '13a', 'smooth': 'exp', 'version': '2.0.0'}, {'name': 'BLEU', 'score': 19.2, 'signature': 'nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0', 'verbose_score': '61.5/33.3/18.2/5.0 (BP = 0.926 ratio = 0.929 hyp_len = 13 ref_len = 14)', 'nrefs': '1', 'case': 'lc', 'eff': 'no', 'tok': '13a', 'smooth': 'exp', 'version': '2.0.0'}]
[{'name': 'chrF2  ', 'score': 49.6, 'signature': 'nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0', 'nrefs': '1', 'case': 'mixed', 'eff': 'yes', 'nc': '6', 'nw': '2', 'space': 'no', 'version': '2.0.0'}, {'name': 'chrF2  ', 'score': 38.8, 'signature': 'nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0', 'nrefs': '1', 'case': 'mixed', 'eff': 'yes', 'nc': '6', 'nw': '2', 'space': 'no', 'version': '2.0.0'}]

CodePudding user response：

Your data format is almost JSON, except that it appears you're getting multiple lists in a single file, without structure around them:

Your format, abbreviated:

[
  {"some": "dict"}
]
[
  {"some": "dict"}
]

Valid JSON:

[
  [
    {"some": "dict"}
  ],
  [
    {"some": "dict"}
  ]
]

So, an approach would be to add square brackets around the full content and replace any occurrence of a closing square bracket followed by nothing but whitespace (including newlines) and another opening square bracket by ],[

Of course a limitation of this approach is that a value like "oh ] [ no" would also be modified, so excluding anything in double quotes might be an added requirement, but that goes beyond the scope of your question.

A solution might look like:

import re
import json


def fix_content(s):
    s = re.sub(r']\s\[', '],\n[', s)
    return f'[{s}]'


with open('mess.json') as f:
    data = json.loads(fix_content(f.read()))
    for some_list in data:
        for d in some_list:
            print(d)

Getting those 2 lists of scores:

    BLEUs, chrF2s = zip(*((d['BLEU'], d['chrF2  '])
                          for d in (dict((d['name'], d['score'])
                                         for d in part) for part in data)))