Home > Enterprise >  Split large csv file into multiple files based on column(s)
Split large csv file into multiple files based on column(s)

Time:12-12

I would like to know of a fast/efficient way in any program (awk/perl/python) to split a csv file (say 10k columns) into multiple small files each containing 2 columns. I would be doing this on a unix machine.

#contents of large_file.csv
1,2,3,4,5,6,7,8
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z

I now want multiple files like this:

# contents of 1.csv
1,2
a,b
q,w
a,s
z,x

# contents of 2.csv
1,3
a,c
q,e
a,d
z,c

# contents of 3.csv
1,4
a,d
q,r
a,f
z,v

and so on...

I can do this currently with awk on small files (say 30 columns) like this:

awk -F, 'BEGIN{OFS=",";} {for (i=1; i < NF; i  ) print $1, $(i 1) > i ".csv"}' large_file.csv

The above takes a very long time with large files and I was wondering if there is a faster and more efficient way of doing the same.

Thanks in advance.

CodePudding user response:

With your show samples, attempts; please try following awk code. Since you are opening files all together it may fail with infamous "too many files opened error" So to avoid that have all values into an array and in END block of this awk code print them one by one and I am closing them ASAP all contents are getting printed to output file.

awk '
BEGIN{ FS=OFS="," }
{
  for(i=1;i<NF;i  ){
    value[i]=(value[i]?value[i] ORS:"") ($1 OFS $(i 1))
  }
}
END{
  for(i=1;i<=NF;i  ){
    outFile=i".csv"
    print value[i] > (outFile)
    close(outFile)
  }
}
' large_file.csv

CodePudding user response:

The main hold up here is in writing so many files.

Here is one way

use warnings;
use strict;
use feature 'say';
    
my $file = shift // die "Usage: $0 csv-file\n";

my @lines = do { local @ARGV = $file; <> };
chomp @lines;

my @fhs = map { 
    open my $fh, '>', "f${_}.csv" or die $!; 
    $fh 
} 
1 .. scalar( split /,/, $lines[0] );

for (@lines) { 
    my ($first, @cols) = split /,/; 
    say {$fhs[$_]} join(',', $first, $cols[$_]) 
        for 0..$#cols;
}

I didn't time this against any other approaches. Assembling data for each file first and then dumping it in one operation into each file may help, but first let us know how large the original CSV file is.

Opening so many output files at once (for @fhs filehandles) may pose problems. If that is the case then the simplest way is to first assemble all data and then open and write a file at a time

use warnings;
use strict;
use feature 'say';

my $file = shift // die "Usage: $0 csv-file\n";

open my $fh, '<', $file or die "Can't open $file: $!";

my @data;
while (<$fh>) {
    chomp;
    my ($first, @cols) = split /,/;
    push @{$data[$_]}, join(',', $first, $cols[$_]) 
        for 0..$#cols;
}

for my $i (0..$#data) {
    open my $fh, '>', $i 1 . '.csv' or die $!;
    say $fh $_ for @{$data[$i]};
}

This depends on whether the entire original CSV file, plus a bit more, can be held in memory.

  • Related