This is my code below. For example, each smaller df contains 3000 rows.
def partition_df(df):
partitioned_df = []
smaller_df = []
for index, row in df.iterrows():
smaller_df.append(row)
if (index % 3000) == 0 and index != 0:
partitioned_df.append(pd.DataFrame(smaller_df))
smaller_df.clear()
if smaller_df:
partitioned_df.append(pd.DataFrame(smaller_df))
return partitioned_df
Is there any issue with this split?
CodePudding user response:
If you need to split them and save them then you can make use of the chunksize
argument when loading in the dataframe.
chunk_size=3000
for idx, chunk in enumerate(pd.read_csv('file.csv',chunksize=chunk_size)):
chunk.to_csv(f'chunk_{idx}.csv',index=False)
CodePudding user response:
Actually there is a function np.array_split
which already does this:
partitioned_df = np.split(df, [3000])
And convert to list:
partitioned_df = np.split(df, [3000]).tolist()