Pandas reading large panel CSV efficiently in chunks based on values of a column-CodePudding

I have a large CSV file (~50 GB on disk), and am unable to read this entirely into memory at once. The dataset itself is panel data and looks something like

ID Time     Col 1 ... Col N
1  2000/1/1 ...
1  2000/1/2
...
2  2000/1/1 ...
...

My idea to load this data is to read it in chunks, do some pre-processing to reduce size, and then save each chunk individually. I am aware of using pd.read_csv(..., chunksize=1000) which lets me loop over chunks of size 1000, but for the pre-processing to be accurate, I would prefer to loop over chunks corresponding to the ID columns. (All the rows corresponding to a specific ID are required for accurate pre-processing)

In other words, let's say I have a smaller file that contains all the ID values (say, 1-1000). Then, I want to do something like

list_of_id_chunks = [ [1,2,3], [4,5,6], [7,8,9], ... ] # Split the total IDs into chunks of 3 IDs each

for chunk_of_ids in list_of_id_chunks:
    # 1. Read the large csv file with only the rows where `ID` is in chunk_of_ids
    # (For the first iteration, this should have rows with ID = 1, 2, or 3)
    # 2. Do some preprocessing to trim file size
    # 3. Save files in csv, feather, etc

Any suggestions?

CodePudding user response：

You could start with something like this, it reads your file 1 million lines at a time, breaks each chunk up by ID, and saves by ID to new files. In the end, you'd have a separate file for each ID.

with pd.read_csv('big_file.csv', chunksize=1e6) as reader:
    for chunk in reader:
        for name, group in chunk.groupby('ID'):
            group.to_csv(f'big_file_id_{name}.csv', mode='a', index=False, header=False)

CodePudding user response：

If you know the indices of the chunk edges (in your example the edges would be [0, 4, 7, 10]) then you can make use of skiprows and nrows in pandas read_csv to achieve this

here is a sample snippet - hope it helps


In [18]: df = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7, 8, 9]})

In [19]: df.to_csv("test.csv", index=None)

In [20]: indices_to_split = [0, 4, 7, 10] # these are the edges of your ids
    ...: 
    ...: for idx, (start_index, end_index) in enumerate(zip(indices_to_split[:-1],
    ...:  indices_to_split[1:])):
    ...:     if idx ==0:
    ...:         df = pd.read_csv(
    ...:         "test.csv",
    ...:         nrows=end_index - start_index - 1,
    ...:         skiprows=start_index,
    ...: 
    ...:         )
    ...:         cols = df.columns
    ...:     else:
    ...:         df = pd.read_csv(
    ...:             "test.csv",
    ...:              nrows=end_index - start_index,
    ...:              skiprows=start_index,
    ...:              header=None,
    ...:              names=cols,
    ...: 
    ...:          )
    ...:     print(df, end="\n" * 3)
    ...: 
   a
0  1
1  2
2  3


   a
0  4
1  5
2  6


   a
0  7
1  8
2  9