Fastest way to replace characters in a very large data file-CodePudding

I have a very large file (19M rows) and the data needs to be cleaned. I am using a windows 11 machine. The data is being loaded into SQL server 19. I am currently using a perl script to remove any commas that are between double quotes. I will post my script below. This is taking a very long time to run. I feel like powershell would be quicker, but I can not seem to get it to run the REGEX that I need.

#! /usr/bin/perl -w
use strict;
use warnings;
 

my $SRC = $ARGV[0];
my $DES = $ARGV[1];
open(SRC,'<',$SRC) or die $!;
open(DES,'>',$DES) or die $!;

my $n_fields = "";
while (<>) {
    s/\s $//;
    if (/^\#/) { # header line
        my @t = split(/,/);
        $n_fields = scalar(@t); # total number of fields
    } else { # actual data
        my $n_commas = $_ =~s/,/,/g; # total number of commas
        foreach my $i (0 .. $n_commas - $n_fields) { # iterate ($n_commas - $n_fields   1) times
            s/(\"[^",] ),([^"] \")/$1\"$2/g; # single replacement per previous answers
        }
        s/\"//g; # removal of double quotes (if you want)
    }
    print DES "$_\n";
}

close (DES);
close (SRC);

The code I provided above works, but is very very slow. Looking for a quicker way to process the data. Than you in advance.

CodePudding user response：

I take the task to be: remove commas under quotes, and remove quotes as well. (So clean up extra commas and remove quotes from a CSV file.)

For starters here is some code that does that. Please provide a sample input (and/or time this code) for improvements.

Using a library (like Text::CSV, see here for an example) is always recommended but may not be faster in this case, so here is a basic regex-based code for now. (Need realistic sample to benchmark it against a CSV library.)^† Please note the assumptions clearly taken.

use warnings;
use strict;
use feature 'say';

my $file = shift // die "Usage: $0 file\n";

open my $fh, $file or die "Can't open $file: $!";

my $header = <$fh>;

if ($header !~ /^#/) {  # not actually a header but data! Process
    print $header =~ s{"[^"] "}{ $& =~ tr/,"//dr }ger;  #"
}

while (<$fh>) {
    print s{"[^"] "}{ $& =~ tr/,"//dr }ger;  #" stop bad syntax hilite
}

^† While waiting for a realistic sample here is a benchmark.

I make an input file by repeating 1,000 times, and shuffling, lines:

have,"text, and, commas","and more, commas",end
so,"no stuff with commas","yes remove, commas",done

Then run on it a program like

use warnings;
use strict;
use feature 'say';

#use Test::More qw(no_plan);  # uncomment to run a test below
use Text::CSV; 
use Benchmark qw(cmpthese);

my $file = shift // die "Usage: $0 file\n";

my $runfor = shift // 3;

sub by_regex {
    my ($fh) = @_;    
    my @lines;

    my $header = <$fh>;
    if ($header !~ /^#/) { # not actually a header but data!
        push @lines, $header =~ s{"([^"] )"}{ $1 =~ tr/,//dr }ger; #"
    }

    while (<$fh>) {
        push @lines, s{"([^"] )"}{ $1 =~ tr/,//dr }ger;  #"
    }

    seek $fh, 0, 0;  # return filehandle to beginning for the next run

    chomp @lines;
    return \@lines;
};#]]

sub by_lib {#[[
    my ($csv, $fh) = @_;
    my @lines;

    my @headers = @{ $csv->getline($fh) };
    if ($headers[0] !~ /^#/) { # not headers! data
        tr/,//d for @$row; 
        push @lines, join ',', @$row;
    }

    while (my $row = $csv->getline($fh)) {  #
        tr/,//d for @$row;
        push @lines, join ',', @$row;
    }

    seek $fh, 0, 0;
    return \@lines;
}#]]

my $csv = Text::CSV->new( { binary => 1, allow_whitespace => 1 } ) 
    or die "Cannot use CSV: " . Text::CSV->error_diag (); 

open my $fh, $file or die $!;

# Check, at least once for every change in code
#is_deeply( by_regex($fh),  by_lib($csv, $fh), 'same' ); 

cmpthese( -$runfor, {
    by_regex => sub { by_regex($fh) },
    by_lib   => sub { by_lib($csv, $fh) },
});

If your input plays loose with commas, having fields like some "then quoted" more instead of the whole field being quoted, then use allow_loose_quotes attribute.

By running program.pl input.csv 30 (run each test for 30 seconds) I get, on an old laptop with v5.16

          Rate   by_lib by_regex
by_lib   130/s       --     -14%
by_regex 152/s      17%       --

(Practically the same on a server with 5.36.0, with rates roughly twice as large.)

If this holds in your case, and the regex cannot be substantially improved, then I'd still strongly recommend using the code with the CSV library instead. The library's correctness and ability to handle a range of unexpected/bad CSV properties is far more worth than 20-ish percent of speed improvement.

Then instead of collecting lines in an array (push @lines, join ',', @$row;), done above for benchmarking purpose, print them as they come (say join ',', @$row;), at least for very large files.

CodePudding user response：

This looks like you have created an endless loop, writing to a file while reading it.

my $SRC = $ARGV[0];     # files in your @ARGV will be read by <> diamond operator
my $DES = $ARGV[1];
....
while (<>) {            # should be <SRC>, but is <>
....
    print DES "$_\n";   # adding lines to your output file while reading it with <>

Change while (<>) to while (<SRC>) and see what that does. And probably check the file size of your output file to confirm a huge file indicating an endless loop.