Perl compare two files and print occurrences-CodePudding

All,

I have 2 files - SC and ID. in ID, I have just 2 columns separated by a whitespace. In SC, there are many more columns but might have a pair present in ID.

for e.g.

ID  

chain_0 123
chain_1 456
chain_2 789

SC  

chain_0 123 toronto ontario canada
chain_1 456 toronto New Delhi India 
chain_2 789 housing_crisis mortgage_rates first_time_buyers miserable

No I want to print lines in SC matching the pair in ID. I tried following but this doesn't work.

open(ID, '<', $id) or die $!;

while(<ID>){
   my @array = split ' ', $_;
   $output = `awk '\$1 ~ /\<$array[0]\>/' scan_cells | awk '\$2 ~ /\<$array[1]\>/'` ;
   print "$output";
}

close(ID);

Thanks!!1

CodePudding user response：

One way using just grep and, via bash process substitution and sed, some massaging of the lines of ID to turn them into regular expressions that match only at the start of lines:

grep -f <(sed 's/^/^/; s/[[:space:]]/[[:space:]]/; s/$/[[:space:]]/' ID) SC

And in perl:

#!/usr/bin/env perl
use strict;
use warnings;

# Takes the files as command line arguments
my ($id_file, $sc_file) = @ARGV;

my %ids;

open my $ID, "<", $id_file or die "Unable to open $id_file: $!\n";
while (<$ID>) {
    # Just in case there's a tab instead of a single space between columns
    $_ = join(" ", split);
    $ids{$_} = 1;
}
close $ID;

open my $SC, "<", $sc_file or die "Unable to open $sc_file: $!\n";
while (<$SC>) {
    my @cols = split;
    print if exists $ids{"@cols[0,1]"};
}
close $SC;

The idea here is to store each line of ID as a key in a hash table, and then for each line of SC, see if the first two columns exist as a key in that table, and if so, print it.

The same approach can be done more succinctly in awk, though:

awk 'FNR == NR { ids[$1,$2] = 1; next }
     ($1,$2) in ids' ID SC

CodePudding user response：

Using awk from a Perl program is almost always a mistake. Whatever you're using awk for, you can probably do more easily in Perl.

Here's how I'd approach your problem. Create a hash where the keys are the IDs and the values are some true value (1 is easiest). Then iterate across the SC file and print only if the start of the line matches a key in the hash.

Something like this:

#!/usr/bin/perl

# Always :-)
use strict;
use warnings;

# Open the id file
open my $id, '<', 'id' or die $!;

# Read the ids in to an array
chomp( my @ids = <$id> );

# Convert the array into a hash
my %id = map { $_ => 1 } @ids;

# Read a line at a time from the file
# given on the command line.
while (<>) {
  # split the line into fields (on whitespace)
  my @data = split;
  # Print only if the first two fields match
  # a record in %id
  print if $id{"$data[0] $data[1]"};
}

This hardcodes the name of the ID file, but you pass the name of the SD file on the command line. If you call this program idfilter, then you'd run it like this:

$ ./idfilter sc

CodePudding user response：

Assuming that the columns of interest in file SC is also for the first 2 columns, and the field separators (whitespace) are the same, you can store the whole line from the file ID in an array a[$0]

While processing the second file, check if column 1 (concatenated by column 2 with the output field separator) occur in the array that holds all entries from file ID.

awk 'FNR == NR{a[$0]; next} $1 OFS $2 in a' ID SC

Test content of the files:

$ cat ID
chain_0 123
chain_1 456
chain_2 789
chain_9 999

$cat SC
chain_0 123 toronto ontario canada
chain_1 456 toronto New Delhi India
chain_2 789 housing_crisis mortgage_rates first_time_buyers miserable
chain_3 999 housing_crisis mortgage_rates first_time_buyers miserable

Output

chain_0 123 toronto ontario canada
chain_1 456 toronto New Delhi India
chain_2 789 housing_crisis mortgage_rates first_time_buyers miserable

If the output field separators are different, you might also use a multidimensional array:

awk 'FNR==NR{a[$1, $2];next} 
{
  for (pair in a) {
    split(pair, sep, SUBSEP);
    if ($1 == sep[1] && $2 == sep[2]) print
  }
}
' ID SC