Adding a comma between JSON objects in a datafile with Python?-CodePudding

I have large file (about 3GB) which contains what looks like a JSON file but isn't because it lacks commas (,) between "observations" or JSON objects (I have about 2 million of these "objects" in my data file).

For example, this is what I have:

{
    "_id": {
        "$id": "fh37fc3huc3"
    },
    "messageid": "4757724838492485088139042828",
    "attachments": [],
    "usernameid": "47284592942",
    "username": "Alex",
    "server": "475774810304151552",
    "text": "Must watch",
    "type": "462050823720009729",
    "datetime": "2018-08-05T21:20:20.486000 00:00",
    "type": {
        "$numberLong": "0"
    }
}

{
    "_id": {
        "$id": "23453532dwq"
    },
    "messageid": "232534",
    "attachments": [],
    "usernameid": "273342",
    "usernameid": "Alice",
    "server": "475774810304151552",
    "text": "https://www.youtube.com/",
    "type": "4620508237200097wd29",
    "datetime": "2018-08-05T21:20:11.803000 00:00",
    "type": {
        "$numberLong": "0"
    }

And this is what I want (the comma between "observations"):

{
    "_id": {
        "$id": "fh37fc3huc3"
    },
    "messageid": "4757724838492485088139042828",
    "attachments": [],
    "username": "Alex",
    "server": "475774810304151552",
    "type": {
        "$numberLong": "0"
    }
},

{
    "_id": {
        "$id": "23453532dwq"
    },
    "messageid": "232534",
    "attachments": [],
    "usernameid": "Alice",
    "server": "475774810304151552",
    "type": {
        "$numberLong": "0"
    }

This is what I tried but it doesn't give me a comma where I need it:

import re

with open('dataframe.txt', 'r') as input, open('out.txt', 'w') as output:
    output.write("[")
    for line in input:
        line = re.sub('', '},{', line)
        output.write('    ' line)
    output.write("]")

What can I do so that I can add a comma between each JSON object in my datafile?

CodePudding user response：

This solution presupposes that none of the fields in JSON contains neither { nor }.

If we assume that there is at least one blank line between JSON dictionaries, an idea: let's maintain unclosed curly brackets count ({) as unclosed_count; and if we meet an empty line, we add the coma once.

Like this:

with open('test.json', 'r') as input_f, open('out.json', 'w') as output_f:
    output_f.write("[")
    unclosed_count = 0
    comma_after_zero_added = True
    for line in input_f:
        unclosed_count_change = line.count('{') - line.count('}')
        unclosed_count  = unclosed_count_change
        if unclosed_count_change != 0:
            comma_after_zero_added = False
        if line.strip() == '' and unclosed_count == 0 and not comma_after_zero_added:
            output_f.write(",\n")
            comma_after_zero_added = True
        else:
            output_f.write(line)
    output_f.write("]")

CodePudding user response：

If you're sure that the only place you will find a blank line is between two dicts, then you can go ahead with your current idea, after you fix its execution. For every line, check if it's empty. If it isn't, write it as-is. If it is, write a comma instead

with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
    output_file.write("[")
    for line in input_file:
        if line.strip():
            output_file.write(line)
        else:
            output_file.write(",")
    output_file.write("]")

If you cannot guarantee that any blank line must be replaced by a comma, you need a different approach. You want to replace a close-bracket, followed by an empty line (or multiple whitespace), followed by an open-bracket, with },{.

You can keep track of the previous two lines in addition to the current line, and if these are "}", "", and "{" in that order, then write a comma before writing the "{".

from collections import deque

with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
    last_two_lines = deque(maxlen=2)
    output_file.write("[")
    for line in input_file:
        line_s = line.strip()
        if line_s == "{" and list(last_two_lines) == ["}", ""]:
            output_file.write(","   line)
        else:
            output_file.write(line)
        last_two_lines.append(line_s)

Alternatively, if you want to stick with regex, then you could do

with open('dataframe.txt') as input_file:
    file_contents = input_file.read()

repl_contents = re.sub(r'\}(\s )\{', r'},\1{', file_contents)

with open('out.txt', 'w') as output_file:
    output_file.write(repl_contents)

Here, the regex r"\}(\s )\{" matches the pattern we're looking for (\s matches multiple whitespace characters, and captures them in group 1, which we then use in the replacement string as \1.

Note that you will need to read and run re.sub on the entire file, which will be slow.

CodePudding user response：

Another way to view your data is that you have multiple json records separated by whitespace. You can use the stdlib JSONDecoder to read each record, then strip whitespace and repeat til done. The decoder reads a record from a string and tells you have far it got. Apply that iteratively to the data until all is consumed. This is far less risky than making a bunch of assumptions about what data is contained in the json itself.

import json

def json_record_reader(filename):
    with open(filename, encoding="utf-8") as f:
        txt = f.read().lstrip()
    decoder = json.JSONDecoder()
    result = []
    while txt:
        data, pos = decoder.raw_decode(txt)
        result.append(data)
        txt = txt[pos:].lstrip()
    return result
    
print(json_record_reader("data.json"))

Considering the size of your file, a memory mapped text file may be the better option.

CodePudding user response：

Assuming sufficient memory, you can parse such a stream one object at a time using json.JSONDecoder.raw_decode directly, instead of using json.loads.

>>> x = '{"a": 1}\n{"b": 2}\n'  # Hypothetical output of open("dataframe.txt").read()
>>> decoder = json.JSONDecoder()
>>> x = '{"a": 1}\n{"b":2}\n'
>>> decoder.raw_decode(x)
({'a': 1}, 8)
>>> decoder.raw_decode(x, 9)
({'b': 2}, 16)

The output of raw_decode is a tuple containing the first JSON value decoded and the position in the string where the remaining data starts. (Note that json.loads just creates an instance of JSONDecoder, and calls the decode method, which just calls raw_decode and artificially raises an exception if the entire input isn't consumed by the first decoded value.)

A little extra work is involved; note that you can't start decoding with whitespace, so you'll have to use the returned index to detect where the next value starts, following any additional whitespace at the returned index.