Perl performance is slow, file I/O issue or due to while loop-CodePudding

I have the following code in my while loop and it is significantly slow, any suggestions on how to improve this?

open IN, "<$FileDir/$file" || Err( "Failed to open $file at location: $FileDir" );
my $linenum = 0;

while ( $line = <IN> ) {
    if ( $linenum == 0 ) {
        Log(" This is header line : $line");
        $linenum  ;
    } else {
        $linenum  ;
        my $csv    = Text::CSV_XS->new();
        my $status = $csv->parse($line);
        my @val    = $csv->fields();

        $index = 0;
        Log("number of parameters for this file is: $sth->{NUM_OF_PARAMS}");
        for ( $index = 0; $index <= $#val; $index   ) {
            if ( $index < $sth->{NUM_OF_PARAMS} ) {
                $sth->bind_param( $index   1, $val[$index] );
            }
        }

        if ( $sth->execute() ) {
            $ifa_dbh->commit();
        } else {
            Log("line $linenum insert failed");
            $ifa_dbh->rollback();
            exit(1);
        }
    }
}

CodePudding user response：

By far the most expensive operation there is accessing the database server; it's a network trip, hundreds of milliseconds or some such, each time.

Are those DB operations inserts, as they appear? If so, instead of inserting row by row construct a string for an insert statement with multiple rows, in principle as many as there are, in that loop. Then run that one transaction.

Test and scale down as needed, if that adds up to too many rows. Can keep adding rows to the string for the insert statement up to a decided maximum number, insert that, then keep going.^†

A few more readily seen inefficiencies

Don't construct an object every time through the loop. Build it once befor the loop, and then use/repopulate as needed in the loop. Then, there is no need for parse fields here, while getline is also a bit faster
Don't need that if statement for every read. First read one line of data, and that's your header. Then enter the loop, without ifs

Altogether, without placeholders which now may not be needed, something like

my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 });

# There's a $table earlier, with its @fields to populate
my $qry = "INSERT into $table (", join(',', @fields), ") VALUES ";

open my $IN, '<', "$FileDir/$file" 
    or Err( "Failed to open $file at location: $FileDir" );

my $header_arrayref = $csv->getline($IN);
Log( "This is header line : @$header_arrayref" );

my @sql_values;
while ( my $row = $csv->getline($IN) ) {       
    # Use as many elements in the row (@$row) as there are @fields
    push @sql_values, '(' . 
        join(',', map { $dbh->quote($_) } @$row[0..$#fields]) . ')';

    # May want to do more to sanitize input further
}

$qry .= join ', ', @sql_values;

# Now $qry is readye. It is
# INSERT into table_name (f1,f2,...) VALUES (v11,v12...), (v21,v22...),...
$dbh->do($qry) or die $DBI::errstr;

I've also corrected the error handling when opening the file, since that || in the question binds too tightly in this case, and there's effectively open IN, ( "<$FileDir/$file" || Err(...) ). We need or instead of || there. Then, the three-argument open is better. See perlopentut

If you do need the placeholders, perhaps because you can't have a single insert but it must be broken into many or for security reasons, then you need to generate the exact ?-tuples for each row to be inserted, and later supply the right number of values for them.

Can assemble data first and then build the ?-tuples based on it

my $qry = "INSERT into $table (", join(',', @fields), ") VALUES ";

...

my @data;
while ( my $row = $csv->getline($IN) ) {    
    push @data, [ @$row[0..$#fields] ];
}

# Append the right number of (?,?...),... with the right number of ? in each
$qry .=  join ', ', map { '(' . join(',', ('?')x@$_) . ')' } @data;

# Now $qry is ready to bind and execute
# INSERT into table_name (f1,f2,...) VALUES (?,?,...), (?,?,...), ...
$dbh->do($qry, undef, map { @$_ } @data) or die $DBI::errstr;

This may generate a very large string, what may push the limits of your RDBMS or some other resource. In that case break @data into smaller batches. Then prepare the statement with the right number of (?,?,...) row-values for a batch, and execute in the loop over the batches.^‡

Finally, another way altogether is to directly load data from a file using the database's tool for that particular purpose. This will be far faster than going through DBI, probably even including the need to process your input CSV into another one which will have only the needed data.

Since you don't need all data from your input CSV file, first read and process the file as above and write out a file with only the needed data (@data above). Then, there's two possible ways

Either use an SQL command for this – COPY in PostgreSQL, LOAD DATA [LOCAL] INFILE in MySQL and Oracle (etc); or,
Use a dedicated tool for importing/loading files from your RDBMS – mysqlimport (MySQL), SQL*Loader/sqlldr (Oracle), etc. I'd expect this to be the fastest way

The second of these options can also be done out of a program, by running the appropriate tool as an external command via system (or better yet via the suitable libraries).

^† In one application I've put together as much as millions of rows in the initial insert -- the string itself for that statement was in high tens of MB -- and that keeps running with ~100k rows inserted in a single statement daily, for a few years by now. This is postgresql on good servers, and of course ymmv.

^‡ Some RDBMS do not support a multi-row (batch) insert query like the one used here; in particular Oracle seems not to. (We were informed in the end that that's the database used here.) But there are other ways to do it in Oracle, please see links in comments, and search for more. Then the script will need to construct a different query but the principle of operation is the same.