I have a dataframe like below
id B C
1 2 3
1 3 4
2 4 2
3 12 32
finally I want to store the csv file 1.csv, 2.csv, 3.csv which contains all the rows specific to id column Can I do this efficiently.I know we can do using for loop which is time consuming
CodePudding user response:
From the Pandas DOC, the method from the DataFrame you have to write the content in CSV file is to_csv. Looks like there is no specific parameter to optimize it for you.
As you can see here.
You can solve this problem in an O(n) operation, considering ordered IDs. You already have the entire DataFrame in memory. By saving pieces in single files you also can free some space in memory by splitting the entire DataFrame each loop step.
CodePudding user response:
As suggested by @Lazyer, you can use multiprocessing
:
import pandas as pd
import numpy as np
import multiprocessing as mp
import time
def to_csv(name, df):
df.to_csv(f'export/{name}.csv', index=False)
if __name__ == '__main__': # Do not remove this line! Mandatory
# Setup a minimal reproducible example
N = 10_000_000
rng = np.random.default_rng(2022)
df = pd.DataFrame(np.random.randint(1, 10000, (N, 3)),
columns=['id', 'B', 'C'])
# Multi processing
start = time.time()
with mp.Pool(mp.cpu_count()) as pool:
pool.starmap(to_csv, df.groupby('id'))
end = time.time()
print(f"[MP] Elapsed time: {end - start:.2f} seconds")
# Single processing
start = time.time()
for name, subdf in df.groupby('id'):
subdf.to_csv(f'export/{name}.csv', index=False)
end = time.time()
print(f"[SP] Elapsed time: {end - start:.2f} seconds")
Test for 10,000,000 records:
[...]$ python mp.py
[MP] Elapsed time: 2.99 seconds
[SP] Elapsed time: 12.97 seconds