Home > OS >  Split large DataFrame into Dataframes containing records of unique values in a column
Split large DataFrame into Dataframes containing records of unique values in a column

Time:06-29

A csv file has 90 million rows. One of the Columns in named "State". It has 12 unique values at present. (The count of unique values in the "State" column is dynamic and can change with each csv file.)

I want to split the DataFrame into smaller chunks and then save State-wise files. The code below is not working.

source_path = "DataJune.txt"
for i,chunk in enumerate(pd.read_csv(source_path, sep = '|',chunksize=1000000)):
    dfs = dict(tuple(chunk.groupby('State')))
    for i, df in dfs.items():
        df = df.append(df)
        df.to_csv("tempcsv/"   i  ".csv",sep=",", index = False)

CodePudding user response:

You can use:

import pandas as pd
import os

source_path = 'DataJune.txt'
fps = {}

for chunk in pd.read_csv(source_path, sep='|', chunksize=1000000, dtype=object):
    for state, df in chunk.groupby('State'):
        # New state, create a new file and write headers
        if state not in fps:
            fps[state] = open(f'tempcsv/{state}.csv', 'w')
            fps[state].write(f"{','.join(df.columns)}{os.linesep}")

        # Write data without headers
        df.to_csv(fps[state], index=False, header=False)

# Close files properly
for fp in fps.values():
    fp.close()
del fps

CodePudding user response:

IIUC, Try:

source_path = "DataJune.txt"

from collections import defaultdict

def def_value():
    return pd.DataFrame()
      
# Defining the dict
d = defaultdict(def_value)

for i,chunk in enumerate(pd.read_csv(source_path, sep = '|',chunksize=2)):
    chunk_states = chunk['State'].unique()
    for state in chunk_states:
        d[state]=d[state].append(chunk[chunk['State']==state])
for i, df in d.items():
    df.to_csv("tempcsv/"   str(i)  ".csv",sep=",", index = False)

Another version, based on the @Corralien comment:

source_path = "DataJune.txt"

for i,chunk in enumerate(pd.read_csv(source_path, sep = '|',chunksize=2)):
    chunk_states = chunk['State'].unique()
    
    for state in chunk_states:
        with open("tempcsv/"   str(state)  ".csv",mode='a ') as file:
            for i, row in chunk[chunk['State']==state].iterrows():
                file.write(','.join([str(x) for x in row]))
                file.write('\n')
  • Related