Is this the right way to split a big dataframe into smaller ones with certain count of rows?-CodePudding

This is my code below. For example, each smaller df contains 3000 rows.

def partition_df(df):
    partitioned_df = []
    smaller_df = []
    for index, row in df.iterrows():
        smaller_df.append(row)
        if (index % 3000) == 0 and index != 0:
            partitioned_df.append(pd.DataFrame(smaller_df))
            smaller_df.clear()

    if smaller_df:
        partitioned_df.append(pd.DataFrame(smaller_df))
    return partitioned_df

Is there any issue with this split?

CodePudding user response：

If you need to split them and save them then you can make use of the chunksize argument when loading in the dataframe.

chunk_size=3000

for idx, chunk in enumerate(pd.read_csv('file.csv',chunksize=chunk_size)):
    chunk.to_csv(f'chunk_{idx}.csv',index=False)

CodePudding user response：

Actually there is a function np.array_split which already does this:

partitioned_df = np.split(df, [3000])

And convert to list:

partitioned_df = np.split(df, [3000]).tolist()