Reading certain lines from a file in python without loading it in memory-CodePudding

I have a large csv file (90000 rows, 20000 columns). I want to read certain rows from it in Python using an index list. But i don't want to load the entire file using something like "read_csv" as this will fill up my memory and the program will crash. IS there any way to do it without loading the entire file?

I am open to changing the format of the input file(.csv) if that will help my case.

CodePudding user response：

You can iterate over the lines in a file without loading them all into memory at once using the standard open function.

with open(my_file) as f:
    for line in f:
        ...

If you have a collection of line numbers you want to keep, you can capture just those lines with the following.

lines_to_keep = {...}
with open(my_file) as f:
    lines = [
        line for line_num, line in enumerate(f)
        if line_num in lines_to_keep
    ]

This uses a list comprehension to iterate over the lines and their indices (thanks to enumerate), keeping only those lines whose index appears in the set (or other collection) lines_to_keep.

If you have a lot of lines to keep, you might want this slightly more verbose method, which avoids the excess line number checking present in the first method.

lines = []
with open(my_file) as f:
    lines_iter = enumerate(f)
    for next_index in sorted(set(lines_to_keep)):
        for line_num, line in lines_iter:
            if line_num == next_index:
                lines.append(line)
                break

This method ensures uniqueness of the values in lines_to_keep and sorts them. Then, rather than checking line number membership against the whole collection, we can check only the next index in our sorted set.

In either case, you are left with a list of strings that you can pass to pandas.read_csv via an io.StringIO object.

from io import StringIO
import pandas as pd

df = pd.read_csv(StringIO("".join(lines)))

Inspired by rdas's answer, I've added the following as perhaps the most compact way to accomplish your goal.

chunksize = 10_000
df = pd.concat([
    chunk.loc[chunk.index.intersection(lines_to_keep)]
    for chunk in pd.read_csv(my_file, chunksize=chunksize)
])

It is somewhat slower than both of the above methods in my testing (extracting 500 rows from a CSV file of 1,000 total rows) but is nicely readable.

CodePudding user response：

You can use pandas.read_csv with chunk_size to read the file partially into memory instead of completely.

Pseudocode:

for i in range(num_rows/chunk_size):
    reader = pd.read_csv('file.csv', chunk_size=chunk_size)
    for chunk in reader:
        # process chunk
        # if any of the indexes you want is present in this chunk, then grab it

CodePudding user response：

index_list = [22, 5013, 101394]
output_rows = []

with open('large_file.csv') as f:
    i = 0
    for line in f:
        if i in index_list:
            output_rows.append(line)
        i  = 1

print(output_rows)

CodePudding user response：

to read certain rows from csv file use skiprows argument of pandas.read_csv function:

skiprows list-like, int or callable, optional

Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].