Home > front end >  AWS Sagemaker output how to read file with multiple json objects spread out over multiple lines
AWS Sagemaker output how to read file with multiple json objects spread out over multiple lines

Time:10-31

I have a bunch of json files that look like this

{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664, ], "word": "blah blah blah"}
{"vector": [0.01027186680585146, 0.04181386157870293, -0.07363887131214142, ], "word": "blah blah blah"}
{"vector": [0.011699287220835686, 0.04741542786359787, -0.07899319380521774, ], "word": "blah blah blah"}

Which I can read in with

f = open(file_name)
data = []
for line in f:
   data.append(json.dumps(line))

But I have another file with output like this

{
    "predictions": [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]
    ]
}
{
    "predictions": [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]
    ]
}
{
    "predictions": [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]
    ]
}

I.e. the json is formatted over several lines, so I can't simply read the json in line for line. Is there an easy way to parse this? Or do I have to write something that stitches together each json object line by line and the does json.loads?

CodePudding user response:

Hmm, as far as I know there's unfortunately no way to load a JSONL format data using json.loads. One option though, is to come up with a helper function that can convert it to a valid JSON string, as below:

import json

string = """
{
    "predictions": [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]
    ]
}
{
    "predictions": [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]
    ]
}
{
    "predictions": [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]
    ]
}
"""


def json_lines_to_json(s: str) -> str:
    # replace the first occurrence of '{'
    s = s.replace('{', '[{', 1)

    # replace the last occurrence of '}
    s = s.rsplit('}', 1)[0]   '}]'

    # now go in and replace all occurrences of '}' immediately followed
    # by newline with a '},'
    s = s.replace('}\n', '},\n')

    return s


print(json.loads(json_lines_to_json(string)))

Prints:

[{'predictions': [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]]}, {'predictions': [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]]}, {'predictions': [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]]}]

Note: your first example actually doesn't seem like valid JSON (or at least JSON lines from my understanding). In particular, this part appears to be invalid due to a trailing comma after the last array element:

{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664, ], ...}

To ensure it's valid after calling the helper function, you'd also need to remove the trailing commas, so each line is in the below format:

{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664 ], ...},

There also appears to be a similar question where they suggest splitting on newlines and calling json.loads on each line; actually it should be (slightly) less performant to call json.loads multiple times on each object, rather than once on the list, as I show below.

from timeit import timeit
import json


string = """\
{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664 ], "word": "blah blah blah"}
{"vector": [0.01027186680585146, 0.04181386157870293, -0.07363887131214142 ], "word": "blah blah blah"}
{"vector": [0.011699287220835686, 0.04741542786359787, -0.07899319380521774 ], "word": "blah blah blah"}\
"""


def json_lines_to_json(s: str) -> str:

    # Strip newlines from end, then replace all occurrences of '}' followed
    # by a newline, by a '},' followed by a newline.
    s = s.rstrip('\n').replace('}\n', '},\n')

    # return string value wrapped in brackets (list)
    return f'[{s}]'


n = 10_000

print('string replace:        ', timeit(r'json.loads(json_lines_to_json(string))', number=n, globals=globals()))
print('json.loads each line:  ', timeit(r'[json.loads(line) for line in string.split("\n")]', number=n, globals=globals()))

Result:

string replace:         0.07599360000000001
json.loads each line:   0.1078384
  • Related