Home > Net >  How to match and find common based on substring from two files?
How to match and find common based on substring from two files?

Time:06-18

I have two files. File1 contains list of email addresses. File2 contains list of domains.

I want to filter out all the email addresses after matching exact domain using Perl script.

I am using below code, but I don't get correct result.

#!/usr/bin/perl 
#use strict;
#use warnings;
use feature 'say';

my $file1 = "/home/user/domain_file" or die " FIle not found\n";
my $file2 = "/home/user/email_address_file" or die " FIle not found\n";

my $match = open(MATCH, ">matching_domain") || die;

open(my $data1, '<', $file1) or die "Could not open '$file1' $!\n";
my @wrd = <$data1>;
chomp @wrd;
# loop on the fiile to be searched
open(my $data2, '<', $file2) or die "Could not open '$file2' $!\n";
while(my $line = <$data2>) {
    chomp $line;
    foreach (@wrd) {
        if($line =~ /\@$_$/) {
            print MATCH "$line\n";
        }
    }
}

File1

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

File2

yahoo.com
gmail.com

Expected output

[email protected]
[email protected]

CodePudding user response:

First off, since you seem to be on *nix, you might want to check out grep -f, which can take search patterns from a given file. I'm no expert in grep, but I would try the file and "match whole words" and this should be fairly easy.

Second: Your Perl code can be improved, but it works as expected. If you put the emails and domains in the files as indicated by your code. It may be that you have mixed the files up.

If I run your code, fixing only the paths, and keeping the domains in file1, it does create the file matching_domain and it contains your expected output:

[email protected]
[email protected]

So I don't know what you think your problem is (because you did not say). Maybe you were expecting it to print output to the terminal. Either way, it does work, but there are things to fix.

#use strict;
#use warnings;

It is a huge mistake to remove these two. Biggest mistake you will ever do while coding Perl. It will not remove your errors, just hide them. You will spend 10 times as much time bug fixing. Uncomment this as your first thing you do to fix this.

use feature 'say';

You never use this. You could for example replace print MATCH "$line\n" with say MATCH $line, which is slightly more concise.

my $file1 = "/home/user/domain_file" or die " FIle not found\n";
my $file2 = "/home/user/email_address_file" or die " FIle not found\n";

This is very incorrect. You are placing a condition on the creation of a variable. If the condition fails, does the variable exist? Don't do this. I assume this is to check if the file exists, but that is not what this does. To check if a file exists, you can use -e, documented as perldoc "-X" (various file tests).

Furthermore, a statement in the form of a string, "/home/user..." is TRUE ("truthy"), as far as Perl conditions are concerned. It is only false if it is "0" (zero), "" (empty) or undef (undefined). So your or clause will never be executed. E.g. "foo" or die will never die.

Lastly, this test is quite meaningless, as you will be testing this in your open statement later on anyway. If the file does not exist, the open will fail and your program will die.

my $match = open(MATCH, ">matching_domain") || die;

This is also very incorrect. First off, you never use the $match variable. Secondly, I bet it does not contain what you think it does. (it contains a boolean which states whether open was successful or not, see perldoc -f open) Thirdly, again, don't put conditions on my declarations of variables, it is a bad idea.

What this statement really means is that $match will contain either the return value of the open, or the return value of die. This should probably be simply:

open my $match, ">", "matching_domain" or die "Cannot open '$match': $!;

Also, use the three argument open with explicit open MODE, and use lexical file handles, like you have done elsewhere.

And one more thing on top of all the stuff I've already badgered you with: I don't recommend hard coding output files for small programs like this. If you want to redirect the output, use shell redirection: perl foo.pl > output.txt. I think this is what has prompted you to think something is wrong with your code: You don't see the output.

Other than that, your code is fine, as near as I can tell. You may want to chomp the lines from the domain file, but it should not matter. Also remember that indentation is a good thing, and it helps you read your code. I mentioned this in a comment, but it was removed for some reason. It is important though.

Good luck!

CodePudding user response:

This assumes that the lines labeled File1 are in the file pointed to by $file1 and the lines labeled File2 are in the file pointed to by $file2.

You have your variables swapped. You want to match what is in $line against $_, not the other way around:

# loop on the file to be searched
open( my $data2, '<', $file2 ) or die "Could not open '$file2' $!\n";
while ( my $line = <$data2> ) {
    chomp $line;
    foreach (@wrd) {
        if (/\@$line$/) {
            print MATCH "$_\n";
        }
    }
}

You should un-comment the warnings and strict lines:

use strict;
use warnings;

warnings shows you that the or die checks are not really working the way you intended in the file name assignment statements. Just use :

my $file1 = "/home/user/domain_file";
my $file2 = "/home/user/email_address_file";

You are already doing the checks where they belong (on open).

  •  Tags:  
  • perl
  • Related