I have two files
first file blacklist.txt contains
test.example.com
*.test.example.com
second file subdomains.txt contains
test.example.com
123.test.example.com
abc-test.example.com
www.example.com
Expected results file conatains
abc-test.example.com
www.example.com
which is filter all subdomains listed in blacklist file from subdomains.txt
AND it check for regex in the same time if there is *.
then will remove all subdomains too as shown in expected results .
during my search i find the following command using awk but dos not work in case there is *
in file blacklist.txt
awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' blacklist.txt subdomains.txt
but when it contains only domain it will filter out
i have tried comm command too but it look like dos not handle regex
CodePudding user response:
If you modify the blacklist regexps a bit:
test\.example\.com
.*\.test\.example\.com
you could:
$ awk 'NR==FNR {
a[$0];next
}
{
for(i in a)
if(match($0,"^" i "$"))
next
print
}' blacklist subdomains
Output:
abc-test.example.com
www.example.com
CodePudding user response:
If you process the blacklist with sed to turn it into a regex list, you can do it with:
sed 's/\./\\./g;s/\*/.*/g' blacklist.txt |
grep -vixf - subdomains.txt
CodePudding user response:
A pure bash
version that doesn't require any pre-processing of the blacklist patterns:
#!/usr/bin/env bash
readarray -t blacklist < blacklist.txt
while read -r domain; do
match=0
for pat in "${blacklist[@]}"; do
if [[ $domain == $pat ]]; then
match=1
break
fi
done
[[ $match -eq 0 ]] && printf "%s\n" "$domain"
done < subdomains.txt
And to throw it in, a tcl
version that should be much more efficient than the above script on large files:
#!/usr/bin/env tclsh
# Takes two arguments; the blacklist file and the domain file
# e.g.,
# ./domainfilter blacklist.txt subdomains.txt > results.txt
proc ggrep {blacklist domainfile} {
set f [open $domainfile]
set domains [split [read -nonewline $f] \n]
close $f
set f [open $blacklist]
while {[gets $f pattern] >= 0} {
set domains [lsearch -inline -all -not -glob $domains $pattern]
}
close $f
puts [join $domains \n]
}
ggrep [lindex $argv 0] [lindex $argv 1]
Also a more efficient zsh
version, if that shell is an option:
#!/usr/bin/env zsh
declare -A blacklist
while read -r pattern; do
blacklist[$pattern]=1
done < blacklist.txt
while read -r domain; do
# Treat the array keys as glob patterns matched against index
[[ -z ${blacklist[(k)$domain]} ]] && printf "%s\n" "$domain"
done < subdomains.txt