Home > other >  Bisection search in the sorted lines of an opened file (not loaded in memory)
Bisection search in the sorted lines of an opened file (not loaded in memory)

Time:02-23

I have a text file of multiple gigabytes, and the millions of lines are sorted:

aaaaa
bcijkjf
dfsdf
gdfgdfhqiuzhf
zzdiszfhj

How is it possible, without loading the whole file in memory, to search if a line is existent, with a bisection search? (Possibly in O(log n) in the number of lines)

Is there a function like bisect.bisect_left among the lines of a file f = open('file.txt', 'r') in the Python library?

The window would initially be [a, b] = [0, file_size]. Then it would seek in the file at position m=(a b)/2, look for the next \n, and read the following line l. If the pattern to search is smaller or greater than l (with lexicographic order), then we continue on [m, b] or [a, m]. Before rolling my own, does this exist in Python?

CodePudding user response:

You can use the mmap built-in module. It provides random access to files (i.e., a file behaves like a large array of bytes stored in the file system). You can find more info here.

import mmap

def bisect_search(file_path, line):
    line = line.encode()
    with open(file_path, 'r b') as f:
        mm = mmap.mmap(f.fileno(), 0)
        lo = 0
        hi = mm.size()
        while lo < hi:
            mid = (lo   hi) // 2
            left_endl_idx = mm.rfind(b'\n', lo, mid)
            right_endl_idx = mm.find(b'\n', mid, hi)
            if left_endl_idx == -1:
                left_endl_idx = lo - 1
            if right_endl_idx == -1:
                right_endl_idx = hi
            mid_line = mm[left_endl_idx   1: right_endl_idx]
            if mid_line == line:
                return True
            if mid_line < line:
                lo = right_endl_idx   1
            else:
                hi = left_endl_idx
    return False

The function returns True if line exists within the file, False otherwise. Let's use the following myfile.txt file to run a few examples:

aaaaa
bcijkjf
dfsdf
gdfgdfhqiuzhf
zzdiszfhj
>>> bisect_search('myfile.txt', 'hello')
False
>>> bisect_search('myfile.txt', 'aaaaa')
True
>>> bisect_search('myfile.txt', 'aaaa')
False
>>> bisect_search('myfile.txt', 'dfsdf')
True
>>> bisect_search('myfile.txt', 'zzdiszfhjj')
False

This function should be much faster than a linear search on a big file.

Note: this code works with \n endings, and not currently with \r\n Windows-style endings (not necessary for OP).

  • Related