have an issue with regular expression using awk. I have an array of links. I need to get all links that start with www.yahoo.com except for ones, which are followed by codeword.
www.yahoo.com/nocodeword/...
is ok, www.yahoo.com/codeword/...
not okey.
What is a workaround for regular expressions negative lookahead in POSIX Extended Regular Expression (ERE)? As I understood there is no such a native thing in awk, sed, grep. I came up with the solution of nested if statements. Is there a better approach? Can't use GNU or grep -P.
Here is my solution:
links=("https://www.yahoo.com/" "https://www.yahoo.com/codeword/lalala1" "https://www.yahoo.com/nocodeword" "https://www.bing.com/" "https://www.yahoo.com/codeword" "https://www.yahoo.com/codeword/lalala2" "https://www.google.com/" "https://www.yahoo.com/foo/codeword" "https://www.yahoo.com/codewordbar")
exception_link="www.yahoo.com"
exception_word="codeword"
for link in "${links[@]}";
do
processed_link=`awk -F/ '{print $3}' <<<"$link"`
if [[ "$processed_link" == "$exception_link" ]];
then
processed_link2=`awk -F/ '{print $4}' <<<"$link"`
if [[ $processed_link2 != $exception_word ]]; then
echo "${link}"
fi
fi
done
Excpected outut:
https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.yahoo.com/codewordbar
Would be grateful for the help.
CodePudding user response:
Lookahead, lookbehind, etc. are all just syntactic sugar in the context of Unix tools. You don't need them and you don't need to find an alternative to them if they're not something you think about it in the first place.
Assuming you have some reason to first create an array of links in shell, I'd just do the rest in 1 call to awk:
$ cat tst.sh
#!/usr/bin/env bash
links=("https://www.yahoo.com/" "https://www.yahoo.com/codeword/lalala1" "https://www.yahoo.com/nocodeword" "https://www.bing.com/" "https://www.yahoo.com/codeword" "https://www.yahoo.com/codeword/lalala2" "https://www.google.com/" "https://www.yahoo.com/foo/codeword" "https://www.yahoo.com/codewordbar")
awk -v links_="$(printf '%s\n' "${links[@]}")" '
BEGIN {
exception_link="www.yahoo.com"
exception_word="codeword"
split(links_,links,RS)
for ( i in links ) {
schemeless_link = links[i]
sub("^[^:] ://","",schemeless_link)
if ( (index(schemeless_link"/",exception_link"/") == 1) &&
!(index(schemeless_link"/",exception_link"/"exception_word"/") == 1) ) {
print links[i]
}
}
}
'
https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.yahoo.com/foo/codeword
https://www.yahoo.com/codewordbar
of if foo/codeword
should not be present in the output then:
$ cat tst.sh
#!/usr/bin/env bash
links=("https://www.yahoo.com/" "https://www.yahoo.com/codeword/lalala1" "https://www.yahoo.com/nocodeword" "https://www.bing.com/" "https://www.yahoo.com/codeword" "https://www.yahoo.com/codeword/lalala2" "https://www.google.com/" "https://www.yahoo.com/foo/codeword" "https://www.yahoo.com/codewordbar")
awk -v links_="$(printf '%s\n' "${links[@]}")" '
BEGIN {
exception_link="www.yahoo.com"
exception_word="codeword"
split(links_,links,RS)
for ( i in links ) {
schemeless_link = links[i]
sub("^[^:] ://","",schemeless_link)
if ( (index(schemeless_link"/",exception_link"/") == 1) &&
!index(schemeless_link"/","/"exception_word"/") ) {
print links[i]
}
}
}
'
$ ./tst.sh
https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.yahoo.com/codewordbar
The above assumes that none of your links can contain literal newline characters and that you want to test for yahoo.com whether the link starts with https or http or any other scheme.
CodePudding user response:
It seems using pattern matching, instead of regular expressions, with the [[...]]
construct of bash
is sufficient and much simpler for this problem:
for lnk in "${links[@]}"; do
if [[ $lnk == https://www.yahoo.com* && $lnk/ != */codeword/* ]]; then
echo "$lnk"
fi
done
prints out
https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.yahoo.com/codewordbar
CodePudding user response:
If just skipping the records with codeword
as the first part of the endpoint, you might do better to just filter them out of your input.
Otherwise, I think you'd be better served either using bash
built-in string processing, or converting the whole logic to awk
(or perl
, or python
, etc...)
This code should demonstrate those variations...
$: cat tst
#!/bin/bash
exception_host="www.yahoo.com"
exception_word="codeword"
printf "\nUnfiltered list: \n\n"
cat links.txt
printf "\nPre-filtering only at start of path - \n\n"
mapfile -t culled < <( grep -Ev "https://$exception_host/codeword(/|\$)" links.txt )
printf "%s\n" "${culled[@]}"
printf "\nPre-filtering anywhere in path - \n\n"
mapfile -t culled < <( grep -Ev "https://$exception_host(/|/. /)codeword(/|\$)" links.txt )
printf "%s\n" "${culled[@]}"
printf "\nIn-line filtering at start of path only - \n\n"
mapfile -t links < links.txt
shopt -s extglob
for lnk in "${links[@]}"; do
case "$lnk" in https://$exception_host/codeword?(/*|)) continue;; esac
echo "$lnk"
done
printf "\nIn-line filtering sanywhere in path - \n\n"
for lnk in "${links[@]}"; do
case "$lnk" in
https://$exception_host/codeword) continue;;
https://$exception_host/codeword/*) continue;;
https://$exception_host/*/codeword) continue;;
https://$exception_host/*/codeword/*) continue;;
esac
echo "$lnk"
done
echo
Running it:
$: ./tst
Unfiltered list:
https://www.yahoo.com/
https://www.yahoo.com/codeword/lalala1
https://www.yahoo.com/nocodeword
https://www.bing.com/
https://www.yahoo.com/codeword
https://www.yahoo.com/codewordbar
https://www.yahoo.com/codeword/lalala2
https://www.yahoo.com/lalala2/codeword
https://www.google.com/codeword/
Pre-filtering only at start of path -
https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.bing.com/
https://www.yahoo.com/codewordbar
https://www.yahoo.com/lalala2/codeword
https://www.google.com/codeword/
Pre-filtering anywhere in path -
https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.bing.com/
https://www.yahoo.com/codewordbar
https://www.google.com/codeword/
In-line filtering at start of path only -
https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.bing.com/
https://www.yahoo.com/codewordbar
https://www.yahoo.com/lalala2/codeword
https://www.google.com/codeword/
In-line filtering sanywhere in path -
https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.bing.com/
https://www.yahoo.com/codewordbar
https://www.google.com/codeword/
I'd recommend adding some flexibility of protocol and case sensitivity. Will add these if I get time later.