How to shuffle and split a large csv with headers?-CodePudding

I am trying to find a way to shuffle the lines of a large csv files in Python and then split it into multiple csv files (assigning a number of rows for each files) but I can't manage to find a way to shuffle the large dataset, and keep the headers in each csv. It would help a lot if someone would know how to

Here's the code I found useful for splitting a csv file:

number_of_rows = 100

def write_splitted_csvs(part, lines):
    with open('mycsvhere.csv'  str(part)  '.csv', 'w') as f_out:
        f_out.write(header)
        f_out.writelines(lines)

with open("mycsvhere.csv", "r") as f:
    count = 0
    header = f.readline()
    lines = []
    for line in f:
        count  = 1
        lines.append(line)
        if count % number_of_rows == 0:
            write_splitted_csvs(count // number_of_rows, lines)
            lines = []
    
    if len(lines) > 0:
        write_splitted_csvs((count // number_of_rows)   1, lines)

If anyone knows how to shuffle all these splitted csv this would help a lot! Thank you very much

CodePudding user response：

I would suggest using Pandas if possible.

Shuffling rows, reset the index in place:

import pandas as pd
df = pd.read_csv('mycsvhere.csv'  str(part)  '.csv')
df.sample(frac=1).reset_index(drop=True)

Then you can split into multiple dataframes into a list:

number_of_rows = 100
sub_dfs = [df[i:i   number_of_rows] for i in range(0, df.shape[0], number_of_rows)]

Then if you want to save the csvs locally:

for idx, sub_df in enumerate(sub_dfs):
    sub_df.to_csv(f'csv_{idx}.csv', index=False)

CodePudding user response：

There are 3 needs here :

Shuffle your dataset
Split your dataset
Formatting

For the first 2 steps, there are some nice tools in Sklearn. You can try the stratified shuffle splitter. Sklearn SSS You did not mention Stratified part, but you may need it without knowing it yet ;)

Last part, formatting, it is all up to you. You can check pandas to_csv() function where you can specify your headers, you can(need) specify your headers in the data object aswell (DataFrame). Nothing hard here, just spend a bit of time to specify what you want, and implement it easily :)

Side comments : You can drop pandas, depending on what 'big' is for you, pandas is not 'good' on big data.