I have a large CSV file (~50 GB on disk), and am unable to read this entirely into memory at once. The dataset itself is panel data and looks something like
ID Time Col 1 ... Col N
1 2000/1/1 ...
1 2000/1/2
...
2 2000/1/1 ...
...
My idea to load this data is to read it in chunks, do some pre-processing to reduce size, and then save each chunk individually. I am aware of using pd.read_csv(..., chunksize=1000)
which lets me loop over chunks of size 1000, but for the pre-processing to be accurate, I would prefer to loop over chunks corresponding to the ID columns. (All the rows corresponding to a specific ID
are required for accurate pre-processing)
In other words, let's say I have a smaller file that contains all the ID
values (say, 1-1000). Then, I want to do something like
list_of_id_chunks = [ [1,2,3], [4,5,6], [7,8,9], ... ] # Split the total IDs into chunks of 3 IDs each
for chunk_of_ids in list_of_id_chunks:
# 1. Read the large csv file with only the rows where `ID` is in chunk_of_ids
# (For the first iteration, this should have rows with ID = 1, 2, or 3)
# 2. Do some preprocessing to trim file size
# 3. Save files in csv, feather, etc
Any suggestions?
CodePudding user response:
You could start with something like this, it reads your file 1 million lines at a time, breaks each chunk up by ID, and saves by ID to new files. In the end, you'd have a separate file for each ID.
with pd.read_csv('big_file.csv', chunksize=1e6) as reader:
for chunk in reader:
for name, group in chunk.groupby('ID'):
group.to_csv(f'big_file_id_{name}.csv', mode='a', index=False, header=False)
CodePudding user response:
If you know the indices of the chunk edges (in your example the edges would be [0, 4, 7, 10]
) then you can make use of skiprows and nrows in pandas read_csv to achieve this
here is a sample snippet - hope it helps
In [18]: df = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
In [19]: df.to_csv("test.csv", index=None)
In [20]: indices_to_split = [0, 4, 7, 10] # these are the edges of your ids
...:
...: for idx, (start_index, end_index) in enumerate(zip(indices_to_split[:-1],
...: indices_to_split[1:])):
...: if idx ==0:
...: df = pd.read_csv(
...: "test.csv",
...: nrows=end_index - start_index - 1,
...: skiprows=start_index,
...:
...: )
...: cols = df.columns
...: else:
...: df = pd.read_csv(
...: "test.csv",
...: nrows=end_index - start_index,
...: skiprows=start_index,
...: header=None,
...: names=cols,
...:
...: )
...: print(df, end="\n" * 3)
...:
a
0 1
1 2
2 3
a
0 4
1 5
2 6
a
0 7
1 8
2 9