Having problems getting specific column in file to be printed to output file in a hash-CodePudding

I am working on an assignment and I am having some issues getting portions of a certain file to be printed to the output file in a hash. I was given a large file containing a list of different species (along with some other variables that aren't too important) and I am getting stuck on how to isolate that specific column and put it into a hash that can be printed to the output while counting how many times each species is mentioned.

#!/usr/bin/perl
use strict; use warnings;

open IN, $ARGV[0];                      ## open input file given as argument 1
open OUT, ">", $ARGV[1];                ## open output file given as argument 2
my @cols;                               ## creates variable to hold column of data


print "\nWorking on file: $ARGV[0]\n\n";        ## while data exists in input file, read 
                                                ## line by line
while (my $file = <IN>) {               
    chomp $file;                        ## remove trailing newline
    print "$file\n";
    @cols = split /\t/, $file;          ## split data into columns on tab
    print "@cols[9]\n";
    my %hits;
    $hits{species}  = 1;
    print "$hits{species}\n";
    print OUT "@cols[9]\n";             ##write species column to output file
}
print "File has been read and output written!\n";


close IN;
close OUT;

This is currently what I have for my code and any suggestions or tips would be greatly appreciated. Thanks!

Sample of input data (Bacteria_firmicutes is the 11th column)

Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_firmicutes

CodePudding user response：

I had to change the split functions tab delimiter to split on spaces instead. I shortened the code a bit to make it a little easier to understand. On Perl 5.16.3... this works just fine.

use strict; 
use warnings;

my @cols;                               ## creates variable to hold column of data

print "\nWorking on __DATA__\n\n";

while (my $file = <DATA>) {               
    chomp $file;                        ## remove trailing newline
    print "$file\n";
    @cols = split /\s/, $file;          ## split data into columns on SPACE. OR CHANGE TO TAB DELIMITED SAMPLE DATA
    print "$cols[10]\n";
    my %hits;
    $hits{species}  = 1;
}

__DATA__
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_firmicutes

EDIT: I would also like to submit this for you to study. you can capture the last string before the new line character. This would be useful if you wanted to always get the very last column in the sample data.

Also, using this method, you can modify the regex and search for specific patterns in sample data.

while (my $file = <DATA>) {               

    $file =~ /.*\s(.*)\n$/; #or $file =~ /.*(SOME_PATTERN).*\n$/
    print $1;

}

__DATA__
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_firmicutes

EDIT 2: Here is a good way to keep track of which and how many patterns were matched in the sample data (EDIT: code shortened some more):

use strict; 
use warnings;
 
my %hits; #always declare hash/arrays outside of loops unless its intentional. 

# ADD and/or INCREMENT hash foreach pattern found 

$hits{$1}   while (<DATA> =~ /.*\s(.*)\n$/);

print "$_: $hits{$_}\n", for keys %hits;

__DATA__
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_firmicutes
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_firmicutes
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_firmicutes
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test2
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test2
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test2
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test2
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test2
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test3
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test3

CodePudding user response：

The following code should work, assuming your actual input is tab separated values instead of space separated like the sample input in the question seems to be.

Here's what I changed:

Added mode to first open() call
Assigned $cols[9] to a variable $species then used that as the key for %hits.
Removed the print calls in the while loop
Moved declaration of %hits above the loop
Added foreach loop that loops through %hits and prints the number of times that species name was seen and the species name to *OUT filehandle

#!/usr/bin/perl
use strict; use warnings;

open IN, "<", $ARGV[0];                 ## open input file given as argument 1
open OUT, ">", $ARGV[1];                ## open output file given as argument 2
my @cols;                               ## creates variable to hold column of data


print "\nWorking on file: $ARGV[0]\n\n";        ## while data exists in input file, read 
                                                ## line by line
my %hits;
while (my $file = <IN>) {               
    chomp $file;                        ## remove trailing newline
    print "$file\n";
    @cols = split /\t/, $file;          ## split data into columns on tab
    my $species = $cols[9];
    $hits{$species}  = 1;
}
foreach my $species (keys %hits) {
    print OUT "$hits{$species} $species\n";
}
print "File has been read and output written!\n";


close IN;
close OUT;