I have two huge json files (20GB each) and I need to join them. The files have the following content:
file_1.json = [{"key": "value"}, {...}]
file_2.json = [{"key": "value"}, {...}]
The main problem, however, is that I need all dict to be in the same list. I tried to do this in python, but unfortunately, I don't have the memory to do this operation.
So, I thought maybe I could tackle this with unix commands, by replacing, in the first file, the ]
for ,
(note that there is a space after the comma) and erasing [
from the second file. Then, I would join the two files with the cat
unix command.
Is there a way for me to edit only the last 10 char in unix?
I tried to use echo
and tr
but I might be doing something wrong with the syntax.
CodePudding user response:
You can very easily append to a file in place, i.e. add characters at the end without rewriting the data that's already there. With the right tools (truncate
if your system has it), you can truncate a file in place, i.e. remove characters at the end without rewriting the data that's staying. With the right tools (dd
, if you're feeling adventurous), you can replace a part of a file by a string of the same length, without rewriting the unchanged parts. On the other hand, you can't remove characters from the beginning or middle of a file without rewriting the file (with a few exceptions that aren't relevant here).
But anyway rewriting both files in place wouldn't help you that much. You will need to at least rewrite the content of the second file to append it to the first file.
If you don't need to keep the split files around, you can append the second file to the first file in place, after taking care of the middle punctuation. Remove the last ]
character from the first file, as well as any following spaces and line breaks. Assuming that the first file ends in ]
and a newline and you have GNU core utilities (e.g. non-embedded Linux):
truncate -s -2 file_1.json
Now you can add a comma and optionally a line break to the first file, and append the data from the second file without its first character.
echo , >>file_1.json
tail -c 2 file_2.json >>file_1.json
If you want to keep the original files unmodified, you can make a copy of the first file and truncate it. Or you can directly make a truncated copy of the first file (still assuming GNU coreutils):
head -c -2 file_1.json >concatenated.json
echo , >>concatenated.json
tail -c 2 file_2.json >>concatenated.json
If you're more comfortable with Python, you can do all of this in Python. Just don't read the whole file in one go, i.e. don't call read()
or use readline()
in a way that reads all the lines as once. Instead, read and process a single line at a time (if the lines are short) or a single block of data. Untested code:
with open('concatenated.json', 'wb') as out:
with open('file_1.json', 'rb') as inp:
buf = bytes(1024)
size = inp.seek(-len(buf), io.SEEK_END)
n = inp.readinto(buf)
m = re.search(rb']\s*\Z', buf)
stop_at = m.start()
inp.seek(0, io.SEEK_SET)
n = inp.readinto(buf)
total = n
while n > 0:
out.write(buf)
n = inp.readinto(buf)
total = n
if total > stop_at:
out.write(buf[:len(buf)-(total-stop_at)])
n = 0
out.write(b',')
with open('file_2.json', 'rb') as inp:
buf = bytes(1024)
n = inp.readinto(buf)
assert buf[0] == b'['
buf[0:1] = b'\n'
while n > 0:
out.write(buf)
n = inp.readinto(buf)