Home > Net >  Is this the right way to split a big dataframe into smaller ones with certain count of rows?
Is this the right way to split a big dataframe into smaller ones with certain count of rows?

Time:10-03

This is my code below. For example, each smaller df contains 3000 rows.

def partition_df(df):
    partitioned_df = []
    smaller_df = []
    for index, row in df.iterrows():
        smaller_df.append(row)
        if (index % 3000) == 0 and index != 0:
            partitioned_df.append(pd.DataFrame(smaller_df))
            smaller_df.clear()

    if smaller_df:
        partitioned_df.append(pd.DataFrame(smaller_df))
    return partitioned_df

Is there any issue with this split?

CodePudding user response:

If you need to split them and save them then you can make use of the chunksize argument when loading in the dataframe.

chunk_size=3000

for idx, chunk in enumerate(pd.read_csv('file.csv',chunksize=chunk_size)):
    chunk.to_csv(f'chunk_{idx}.csv',index=False)

CodePudding user response:

Actually there is a function np.array_split which already does this:

partitioned_df = np.split(df, [3000])

And convert to list:

partitioned_df = np.split(df, [3000]).tolist()
  • Related