Home > Software engineering >  Splitting JSON file into smaller parts
Splitting JSON file into smaller parts

Time:11-23

I am trying to split a large JSON file. I came a cross many posts here, but they do not solve my problem.

I have JSON file format as follows where for each line, I have JSON input array for each line

[input1]
[input2]
.
.
.
[inputk]

Input file is very large, so I am not sure if I can post it here? Here is first 2 lines of the file:

[{"type":"Module","children":[1,4,6,34,55]},{"type":"ImportFrom","children":[2],"value":"django.utils.translation"},{"type":"alias","children":[3],"value":"ugettext_lazy"},{"type":"identifier","value":"_"},{"type":"ImportFrom","children":[5],"value":"horizon"},{"type":"alias","value":"tabs"},{"type":"ClassDef","children":[7,11,33],"value":"NetworkProfileTab"},{"type":"bases","children":[8]},{"type":"AttributeLoad","children":[9,10]},{"type":"NameLoad","value":"tabs"},{"type":"attr","value":"Tab"},{"type":"body","children":[12,17,20,23]},{"type":"Assign","children":[13,14]},{"type":"NameStore","value":"name"},{"type":"Call","children":[15,16]},{"type":"NameLoad","value":"_"},{"type":"Str","value":"Network Profile"},{"type":"Assign","children":[18,19]},{"type":"NameStore","value":"slug"},{"type":"Str","value":"network_profile"},{"type":"Assign","children":[21,22]},{"type":"NameStore","value":"template_name"},{"type":"Str","value":"router/nexus1000v/network_profile/index.html"},{"type":"FunctionDef","children":[24,29,32],"value":"get_context_data"},{"type":"arguments","children":[25,28]},{"type":"args","children":[26,27]},{"type":"NameParam","value":"self"},{"type":"NameParam","value":"request"},{"type":"defaults"},{"type":"body","children":[30]},{"type":"Return","children":[31]},{"type":"NameLoad","value":"None"},{"type":"decorator_list"},{"type":"decorator_list"},{"type":"ClassDef","children":[35,39,54],"value":"PolicyProfileTab"},{"type":"bases","children":[36]},{"type":"AttributeLoad","children":[37,38]},{"type":"NameLoad","value":"tabs"},{"type":"attr","value":"Tab"},{"type":"body","children":[40,45,48,51]},{"type":"Assign","children":[41,42]},{"type":"NameStore","value":"name"},{"type":"Call","children":[43,44]},{"type":"NameLoad","value":"_"},{"type":"Str","value":"Policy Profile"},{"type":"Assign","children":[46,47]},{"type":"NameStore","value":"slug"},{"type":"Str","value":"policy_profile"},{"type":"Assign","children":[49,50]},{"type":"NameStore","value":"template_name"},{"type":"Str","value":"router/nexus1000v/policy_profile/index.html"},{"type":"Assign","children":[52,53]},{"type":"NameStore","value":"preload"},{"type":"NameLoad","value":"False"},{"type":"decorator_list"},{"type":"ClassDef","children":[56,60,69],"value":"IndexTabs"},{"type":"bases","children":[57]},{"type":"AttributeLoad","children":[58,59]},{"type":"NameLoad","value":"tabs"},{"type":"attr","value":"TabGroup"},{"type":"body","children":[61,64]},{"type":"Assign","children":[62,63]},{"type":"NameStore","value":"slug"},{"type":"Str","value":"indextabs"},{"type":"Assign","children":[65,66]},{"type":"NameStore","value":"tabs"},{"type":"TupleLoad","children":[67,68]},{"type":"NameLoad","value":"NetworkProfileTab"},{"type":"NameLoad","value":"PolicyProfileTab"},{"type":"decorator_list"}]

[{"type":"Module","children":[1,3,5,7,67,71,75]},{"type":"Expr","children":[2]},{"type":"Str","value":"Greenthread local storage of variables using weak references"},{"type":"Import","children":[4]},{"type":"alias","value":"weakref"},{"type":"ImportFrom","children":[6],"value":"eventlet"},{"type":"alias","value":"corolocal"},{"type":"ClassDef","children":[8,12,66],"value":"WeakLocal"},{"type":"bases","children":[9]},{"type":"AttributeLoad","children":[10,11]},{"type":"NameLoad","value":"corolocal"},{"type":"attr","value":"local"},{"type":"body","children":[13,40]},{"type":"FunctionDef","children":[14,19,39],"value":"__getattribute__"},{"type":"arguments","children":[15,18]},{"type":"args","children":[16,17]},{"type":"NameParam","value":"self"},{"type":"NameParam","value":"attr"},{"type":"defaults"},{"type":"body","children":[20,30,37]},{"type":"Assign","children":[21,22]},{"type":"NameStore","value":"rval"},{"type":"Call","children":[23,28,29]},{"type":"AttributeLoad","children":[24,27]},{"type":"AttributeLoad","children":[25,26]},{"type":"NameLoad","value":"corolocal"},{"type":"attr","value":"local"},{"type":"attr","value":"__getattribute__"},{"type":"NameLoad","value":"self"},{"type":"NameLoad","value":"attr"},{"type":"If","children":[31,32]},{"type":"NameLoad","value":"rval"},{"type":"body","children":[33]},{"type":"Assign","children":[34,35]},{"type":"NameStore","value":"rval"},{"type":"Call","children":[36]},{"type":"NameLoad","value":"rval"},{"type":"Return","children":[38]},{"type":"NameLoad","value":"rval"},{"type":"decorator_list"},{"type":"FunctionDef","children":[41,47,65],"value":"__setattr__"},{"type":"arguments","children":[42,46]},{"type":"args","children":[43,44,45]},{"type":"NameParam","value":"self"},{"type":"NameParam","value":"attr"},{"type":"NameParam","value":"value"},{"type":"defaults"},{"type":"body","children":[48,55]},{"type":"Assign","children":[49,50]},{"type":"NameStore","value":"value"},{"type":"Call","children":[51,54]},{"type":"AttributeLoad","children":[52,53]},{"type":"NameLoad","value":"weakref"},{"type":"attr","value":"ref"},{"type":"NameLoad","value":"value"},{"type":"Return","children":[56]},{"type":"Call","children":[57,62,63,64]},{"type":"AttributeLoad","children":[58,61]},{"type":"AttributeLoad","children":[59,60]},{"type":"NameLoad","value":"corolocal"},{"type":"attr","value":"local"},{"type":"attr","value":"__setattr__"},{"type":"NameLoad","value":"self"},{"type":"NameLoad","value":"attr"},{"type":"NameLoad","value":"value"},{"type":"decorator_list"},{"type":"decorator_list"},{"type":"Assign","children":[68,69]},{"type":"NameStore","value":"store"},{"type":"Call","children":[70]},{"type":"NameLoad","value":"WeakLocal"},{"type":"Assign","children":[72,73]},{"type":"NameStore","value":"weak_store"},{"type":"Call","children":[74]},{"type":"NameLoad","value":"WeakLocal"},{"type":"Assign","children":[76,77]},{"type":"NameStore","value":"strong_store"},{"type":"AttributeLoad","children":[78,79]},{"type":"NameLoad","value":"corolocal"},{"type":"attr","value":"local"}]

I am looking to split the file with sample above to some splits, so I run the following code:

json_file = "path"

import os
import json
#you need to add you path here
with open(os.path.join(json_file, 'test.json'), 'r',
          encoding='utf-8') as f1:
    ll = [line for line in f1.readlines()]

    #this is the total length size of the json file
    print(len(ll))

    #in here 2000 means we getting splits of 2000 tweets
    #you can define your own size of split according to your need
    size_of_the_split=50000
    total = len(ll) // size_of_the_split

    #in here you will get the Number of splits
    print(total 1)

    for i in range(total 1):
        json.dump(ll[i * size_of_the_split:(i   1) * size_of_the_split], open(
            json_file "\\split50k"   str(i 1)   ".json", 'w',
            encoding='utf8'), ensure_ascii=False, indent=True)

Here is what I got (snapshot unless you need also full output sample, please let me know):

enter image description here

The problem with output is that I NO more have the same structure that I have in the input file where I have one array [input] per line as multiple lines got split and added to one array object [[input1][input2]...] although I should have [input1][input2].... Second, double quotation mark added around "[[input1][input2]...]" as you see in the sample output above along with escape letters \" before every string in array object per line [\"type\":\"Module...] although I should have ["type":"Module"...] as the input sample above shows. I just want the original input structure but split across files without changing anything to output file. Can you please help me with that?

CodePudding user response:

Try using json.loads(line) when reading the file:

with open(os.path.join(json_file, 'test.json'), 'r',
          encoding='utf-8') as f1:
    ll = [json.loads(line) for line in f1.readlines()]
    # The rest

CodePudding user response:

The original file is not valid JSON while the json.dump creates a file with valid JSON. My suggestion would be to convert the line items to JSON one at a time when writing to file.

Replace this:

for i in range(total 1):
    json.dump(ll[i * size_of_the_split:(i   1) * size_of_the_split], open(
        json_file "\\split50k"   str(i 1)   ".json", 'w',
        encoding='utf8'), ensure_ascii=False, indent=True)

with this:

for i in range(len(ll)):
    if i % size_of_the_split ==0:
        if i != 0:
            file.close()
        file = open(json_file "\\split50k" str(i 1) ".json",'w')
    file.write(json.dumps(ll[i]) "\n")
file.close()
  • Related