How to shuffle big JSON file?-CodePudding

I have a JSON file with 1 000 000 entries in it (Size: 405 Mb). It looks like that:

[
  {
     "orderkey": 1,
     "name": "John",
     "age": 23,
     "email": "[email protected]"
  },
  {
     "orderkey": 2,
     "name": "Mark",
     "age": 33,
     "email": "[email protected]"
  },
...
]

The data is sorted by "orderkey", I need to shuffle data.

I tried to apply the following Python code. It worked for smaller JSON file, but did not work for my 405 MB one.

import json
import random

with open("sorted.json") as f:
     data = json.load(f)

random.shuffle(data)

with open("sorted.json") as f:
     json.dump(data, f, indent=2)

How to do it?

UPDATE:

Initially I got the following error:

~/Desktop/shuffleData$ python3 toShuffle.py 
Traceback (most recent call last):
  File "/home/andrei/Desktop/shuffleData/toShuffle.py", line 5, in <module>
    data = json.load(f)
  File "/usr/lib/python3.10/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 403646259 (char 403646258)

Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]} that was not valid.

Removing "}" fixed the problem.

CodePudding user response：

Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]} that was not valid format.

Removing "}" in the end fixed the problem.

Provided python code works.

CodePudding user response：

Well this should ideally work unless you have memory constraints.

import random
random.shuffle(data)

In case you are looking for another way and would like to benchmark which is faster for the huge set, you can use the sci-kit learn libraries shuffle function.

from sklearn.utils import shuffle

shuffled_data = shuffle(data)
print(shuffled_data)

Note: Additional package has to be installed called Scikit learn. (pip install -U scikit-learn)