Home > database >  Perl - Compare two large txt files and return the required lines from the first
Perl - Compare two large txt files and return the required lines from the first

Time:12-01

So I am quite new to perl programming. I have two txt files, combined_gff.txt and pegs.txt. I would like to check if each line of pegs.txt is a substring for any of the lines in combined_gff.txt and output only those lines from combined_gff.txt in a separate text file called output.txt

However my code returns empty. Any help please ?

#!/usr/bin/perl -w
use strict;

open (FILE, "<combined_gff.txt") or die "error";
my @gff = <FILE>;
close FILE;

open (DATA, "<pegs.txt") or die "error";
my @ext = <DATA>;
close DATA;

my $str = ''; #final string

foreach my $gffline (@gff) {
    foreach my $extline (@ext) {
        if ( index($gffline, $extline) != -1) {
            
            $str=$str.$gffline;
            $str=$str."\n";
            exit;
        }
    }
}

open (OUT, ">", "output.txt");
print OUT $str;
close (OUT);


CodePudding user response:

The first problem is exit. The output file is never created if a substring is found.

The second problem is chomp: you don't remove newlines from the lines, so the only way how a substring can be found is when a string from pegs.txt is a suffix of a string from combined_gff.txt.

Even after fixing these two problems, the algorithm will be very slow, as you're comparing each line from one file to each line of the second file. It will also print a line multiple times if it contains several different substrings (not sure if that's what you want).

Here's a different approach: First, read all the lines from pegs.txt and assemble them into a regex (quotemeta is needed so that special characters in substrings are interpreted literally in the regex). Then, read combined_gff.txt line by line, if the regex matches the line, print it.

#!/usr/bin/perl
use warnings;
use strict;

open my $data, '<', 'pegs.txt' or die $!;
chomp( my @ext = <$data> );
my $regex = join '|', map quotemeta, @ext;

open my $file, '<', 'combined_gff.txt' or die $!;
open my $out,  '>', 'output.txt' or die $!;
while (<$file>) {
    print {$out} $_ if /$regex/;
}
close $out;

I also switched to 3 argument version of open with lexical filehandles as it's the canonical way (3 argument version is safe even for files named >file or rm *| and lexical filehandles aren't global and are easier to pass as arguments to subroutines). Also, showing the actual error is more helpful than just dying with "error".

CodePudding user response:

As choroba says you don't need the "exit" inside the loop since it ends the complete execution of the script and you must remove the line forwards (LF you do it by chomp lines) to find the matches.

Following the logic of your script I made one with the corrections and it worked fine.

#!/usr/bin/perl -w
use strict;

open (FILE, "<combined_gff.txt") or die "error";
my @gff = <FILE>;
close FILE;

open (DATA, "<pegs.txt") or die "error";
my @ext = <DATA>;
close DATA;

my $str = ''; #final string

foreach my $gffline (@gff) {
    chomp($gffline);
    foreach my $extline (@ext) {
        chomp($extline);
        print $extline;
        if ( index($gffline, $extline) > -1) {
            
            $str .= $gffline ."\n";
            
            
        }
    }
}

open (OUT, ">", "output.txt");
print OUT $str;
close (OUT);

Hope it works for you.

Welcho

  • Related