I'd like to get the last line from a big gzipped log file, without having to iterate on all other lines, because it's a big file.
I have read Print Last Line of File Read In with Python and in particular this answer for big files, but it does not work for gzipped file. Indeed, I tried:
import gzip
with gzip.open(f, 'rb') as g:
g.seek(-2, os.SEEK_END)
while g.read(1) != b'\n': # Keep reading backward until you find the next break-line
g.seek(-2, os.SEEK_CUR)
print(g.readline().decode())
but it already takes more than 80 seconds for a 10 MB compressed / 130 MB decompressed file, on my very standard laptop!
Question: how to seek efficiently to the last line in a gzipped file, with Python?
Side-remark: if not gzipped, this method is very fast: 1 millisecond for a 130 MB file:
import os, time
t0 = time.time()
with open('test', 'rb') as g:
g.seek(-2, os.SEEK_END)
while g.read(1) != b'\n':
g.seek(-2, os.SEEK_CUR)
print(g.readline().decode())
print(time.time() - t0)
CodePudding user response:
If you have no control over the generation of the gzip file, then there is no way to read the last line of the uncompressed data without decoding all of the lines. The time it takes will be O(n), where n is the size of the file. There is no way to make it O(1).
If you do have control on the compression end, then you can create a gzip file that facilitates random access, and you can also keep track of random access entry points to enable jumping to the end of the file.
CodePudding user response:
The slowness is probably due to the many calls of seek
in the loop.
So this solution with only one seek
works:
with gzip.open(f, 'rb') as g:
g.seek(-1000, os.SEEK_END) # go 1000 bytes before end
l = g.readlines()[-1].decode() # the last line
Note that:
g.readlines()
is fast here, because it only splits the last 1000 bytes into lines- change 1000 according to the longest line that could occur in your files
Still looking for a better solution. This is linked but does not give a real solution to get the last line: Lazy Method for Reading Big File in Python?