I have a json-like file in the below format, I would like to store the BLEU
score attribute in a list and the chrF2
score in another list.
The file format:
[
{
"name": "BLEU",
"score": 38.8,
"signature": "nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0",
"verbose_score": "75.0/45.5/30.0/22.2 (BP = 1.000 ratio = 1.000 hyp_len = 12 ref_len = 12)",
"nrefs": "1",
"case": "lc",
"eff": "no",
"tok": "13a",
"smooth": "exp",
"version": "2.0.0"
},
{
"name": "chrF2 ",
"score": 49.6,
"signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0",
"nrefs": "1",
"case": "mixed",
"eff": "yes",
"nc": "6",
"nw": "2",
"space": "no",
"version": "2.0.0"
}
]
[
{
"name": "BLEU",
"score": 19.2,
"signature": "nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0",
"verbose_score": "61.5/33.3/18.2/5.0 (BP = 0.926 ratio = 0.929 hyp_len = 13 ref_len = 14)",
"nrefs": "1",
"case": "lc",
"eff": "no",
"tok": "13a",
"smooth": "exp",
"version": "2.0.0"
},
{
"name": "chrF2 ",
"score": 38.8,
"signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0",
"nrefs": "1",
"case": "mixed",
"eff": "yes",
"nc": "6",
"nw": "2",
"space": "no",
"version": "2.0.0"
}
]
....
I tried:
with open(sys.argv[1]) as f:
for jsonObj in f:
list_of_scores = json.loads(jsonObj)
print(list_of_scores)
bleuScores.append(list_of_scores[0])
chrfScores.append(list_of_scores[1])
but it did not work
CodePudding user response:
Since your file does not seem to be a valid JSON file, therefore I would like to manipulate this file to reformat it into a valid JSON file. After that, you can simply use a for loop to get the desired lists:
import json
with open(sys.argv[1]) as f:
text = f.read()
text = text.replace("[", "").replace("]", "").replace("}", "},") \
.replace("},,", "},").strip().strip(",")
text = "[" text "]"
myDictionary = json.loads(text)
bleus = []
chrs = []
for value in myDictionary:
if value["name"] == "BLEU":
bleus.append(value)
elif value["name"] == "chrF2 ":
chrs.append(value)
print(bleus)
print(chrs)
Output
[{'name': 'BLEU', 'score': 38.8, 'signature': 'nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0', 'verbose_score': '75.0/45.5/30.0/22.2 (BP = 1.000 ratio = 1.000 hyp_len = 12 ref_len = 12)', 'nrefs': '1', 'case': 'lc', 'eff': 'no', 'tok': '13a', 'smooth': 'exp', 'version': '2.0.0'}, {'name': 'BLEU', 'score': 19.2, 'signature': 'nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0', 'verbose_score': '61.5/33.3/18.2/5.0 (BP = 0.926 ratio = 0.929 hyp_len = 13 ref_len = 14)', 'nrefs': '1', 'case': 'lc', 'eff': 'no', 'tok': '13a', 'smooth': 'exp', 'version': '2.0.0'}]
[{'name': 'chrF2 ', 'score': 49.6, 'signature': 'nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0', 'nrefs': '1', 'case': 'mixed', 'eff': 'yes', 'nc': '6', 'nw': '2', 'space': 'no', 'version': '2.0.0'}, {'name': 'chrF2 ', 'score': 38.8, 'signature': 'nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0', 'nrefs': '1', 'case': 'mixed', 'eff': 'yes', 'nc': '6', 'nw': '2', 'space': 'no', 'version': '2.0.0'}]
CodePudding user response:
Your data format is almost JSON, except that it appears you're getting multiple lists in a single file, without structure around them:
Your format, abbreviated:
[
{"some": "dict"}
]
[
{"some": "dict"}
]
Valid JSON:
[
[
{"some": "dict"}
],
[
{"some": "dict"}
]
]
So, an approach would be to add square brackets around the full content and replace any occurrence of a closing square bracket followed by nothing but whitespace (including newlines) and another opening square bracket by ],[
Of course a limitation of this approach is that a value like "oh ] [ no"
would also be modified, so excluding anything in double quotes might be an added requirement, but that goes beyond the scope of your question.
A solution might look like:
import re
import json
def fix_content(s):
s = re.sub(r']\s\[', '],\n[', s)
return f'[{s}]'
with open('mess.json') as f:
data = json.loads(fix_content(f.read()))
for some_list in data:
for d in some_list:
print(d)
Getting those 2 lists of scores:
BLEUs, chrF2s = zip(*((d['BLEU'], d['chrF2 '])
for d in (dict((d['name'], d['score'])
for d in part) for part in data)))