Home > Blockchain >  filter blacklist domains from file
filter blacklist domains from file

Time:09-21

I have two files

first file blacklist.txt contains

test.example.com
*.test.example.com

second file subdomains.txt contains

test.example.com
123.test.example.com
abc-test.example.com
www.example.com

Expected results file conatains

abc-test.example.com
www.example.com

which is filter all subdomains listed in blacklist file from subdomains.txt AND it check for regex in the same time if there is *. then will remove all subdomains too as shown in expected results .

during my search i find the following command using awk but dos not work in case there is * in file blacklist.txt awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' blacklist.txt subdomains.txt

but when it contains only domain it will filter out

i have tried comm command too but it look like dos not handle regex

CodePudding user response:

If you modify the blacklist regexps a bit:

test\.example\.com
.*\.test\.example\.com

you could:

$ awk 'NR==FNR {
    a[$0];next
}
{
    for(i in a)
        if(match($0,"^" i "$"))
            next
    print
}' blacklist subdomains

Output:

abc-test.example.com
www.example.com

CodePudding user response:

If you process the blacklist with sed to turn it into a regex list, you can do it with:

sed 's/\./\\./g;s/\*/.*/g' blacklist.txt |
grep -vixf - subdomains.txt

CodePudding user response:

A pure bash version that doesn't require any pre-processing of the blacklist patterns:

#!/usr/bin/env bash

readarray -t blacklist < blacklist.txt
while read -r domain; do
    match=0
    for pat in "${blacklist[@]}"; do
        if [[ $domain == $pat ]]; then
            match=1
            break
        fi
    done
    [[ $match -eq 0 ]] && printf "%s\n" "$domain"
done < subdomains.txt

And to throw it in, a tcl version that should be much more efficient than the above script on large files:

#!/usr/bin/env tclsh

# Takes two arguments; the blacklist file and the domain file
# e.g.,
# ./domainfilter blacklist.txt subdomains.txt > results.txt

proc ggrep {blacklist domainfile} {
    set f [open $domainfile]
    set domains [split [read -nonewline $f] \n]
    close $f
    set f [open $blacklist]
    while {[gets $f pattern] >= 0} {
        set domains [lsearch -inline -all -not -glob $domains $pattern]
    }
    close $f
    puts [join $domains \n]
}
ggrep [lindex $argv 0] [lindex $argv 1]

Also a more efficient zsh version, if that shell is an option:

#!/usr/bin/env zsh

declare -A blacklist
while read -r pattern; do
    blacklist[$pattern]=1
done < blacklist.txt

while read -r domain; do
    # Treat the array keys as glob patterns matched against index
    [[ -z ${blacklist[(k)$domain]} ]] && printf "%s\n" "$domain"
done < subdomains.txt
  • Related