Awk negative lookahead workaround-CodePudding

have an issue with regular expression using awk. I have an array of links. I need to get all links that start with www.yahoo.com except for ones, which are followed by codeword.

www.yahoo.com/nocodeword/... is ok, www.yahoo.com/codeword/... not okey.

What is a workaround for regular expressions negative lookahead in POSIX Extended Regular Expression (ERE)? As I understood there is no such a native thing in awk, sed, grep. I came up with the solution of nested if statements. Is there a better approach? Can't use GNU or grep -P.

Here is my solution:

links=("https://www.yahoo.com/" "https://www.yahoo.com/codeword/lalala1" "https://www.yahoo.com/nocodeword" "https://www.bing.com/" "https://www.yahoo.com/codeword" "https://www.yahoo.com/codeword/lalala2" "https://www.google.com/" "https://www.yahoo.com/foo/codeword" "https://www.yahoo.com/codewordbar")

exception_link="www.yahoo.com"
exception_word="codeword"

for link in "${links[@]}";
  do
  processed_link=`awk -F/ '{print $3}' <<<"$link"`
    if [[ "$processed_link" == "$exception_link" ]];
    then
        processed_link2=`awk -F/ '{print $4}' <<<"$link"`
        if [[ $processed_link2 != $exception_word ]]; then
          echo "${link}"
        fi
    fi
  done

Excpected outut:

https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.yahoo.com/codewordbar

Would be grateful for the help.

CodePudding user response：

Lookahead, lookbehind, etc. are all just syntactic sugar in the context of Unix tools. You don't need them and you don't need to find an alternative to them if they're not something you think about it in the first place.

Assuming you have some reason to first create an array of links in shell, I'd just do the rest in 1 call to awk:

$ cat tst.sh
#!/usr/bin/env bash

links=("https://www.yahoo.com/" "https://www.yahoo.com/codeword/lalala1" "https://www.yahoo.com/nocodeword" "https://www.bing.com/" "https://www.yahoo.com/codeword" "https://www.yahoo.com/codeword/lalala2" "https://www.google.com/" "https://www.yahoo.com/foo/codeword" "https://www.yahoo.com/codewordbar")

awk -v links_="$(printf '%s\n' "${links[@]}")" '
    BEGIN {
        exception_link="www.yahoo.com"
        exception_word="codeword"

        split(links_,links,RS)
        for ( i in links ) {
            schemeless_link = links[i]
            sub("^[^:] ://","",schemeless_link)
            if ( (index(schemeless_link"/",exception_link"/") == 1) &&
                !(index(schemeless_link"/",exception_link"/"exception_word"/") == 1) ) {
                print links[i]
            }
        }
    }
'

https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.yahoo.com/foo/codeword
https://www.yahoo.com/codewordbar

of if foo/codeword should not be present in the output then:

$ cat tst.sh
#!/usr/bin/env bash

links=("https://www.yahoo.com/" "https://www.yahoo.com/codeword/lalala1" "https://www.yahoo.com/nocodeword" "https://www.bing.com/" "https://www.yahoo.com/codeword" "https://www.yahoo.com/codeword/lalala2" "https://www.google.com/" "https://www.yahoo.com/foo/codeword" "https://www.yahoo.com/codewordbar")

awk -v links_="$(printf '%s\n' "${links[@]}")" '
    BEGIN {
        exception_link="www.yahoo.com"
        exception_word="codeword"

        split(links_,links,RS)
        for ( i in links ) {
            schemeless_link = links[i]
            sub("^[^:] ://","",schemeless_link)
            if ( (index(schemeless_link"/",exception_link"/") == 1) &&
                !index(schemeless_link"/","/"exception_word"/") ) {
                print links[i]
            }
        }
    }
'

$ ./tst.sh
https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.yahoo.com/codewordbar

The above assumes that none of your links can contain literal newline characters and that you want to test for yahoo.com whether the link starts with https or http or any other scheme.

CodePudding user response：

It seems using pattern matching, instead of regular expressions, with the [[...]] construct of bash is sufficient and much simpler for this problem:

for lnk in "${links[@]}"; do
    if [[ $lnk == https://www.yahoo.com* && $lnk/ != */codeword/* ]]; then
        echo "$lnk"
    fi
done

prints out

https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.yahoo.com/codewordbar

CodePudding user response：

If just skipping the records with codeword as the first part of the endpoint, you might do better to just filter them out of your input.

Otherwise, I think you'd be better served either using bash built-in string processing, or converting the whole logic to awk (or perl, or python, etc...)

This code should demonstrate those variations...

$: cat tst
#!/bin/bash

exception_host="www.yahoo.com"
exception_word="codeword"

printf "\nUnfiltered list: \n\n"

cat links.txt

printf "\nPre-filtering only at start of path - \n\n"

mapfile -t culled < <( grep -Ev "https://$exception_host/codeword(/|\$)" links.txt )
printf "%s\n" "${culled[@]}"

printf "\nPre-filtering anywhere in path - \n\n"

mapfile -t culled < <( grep -Ev "https://$exception_host(/|/. /)codeword(/|\$)" links.txt )
printf "%s\n" "${culled[@]}"

printf "\nIn-line filtering at start of path only - \n\n"

mapfile -t links < links.txt

shopt -s extglob
for lnk in "${links[@]}"; do
    case "$lnk" in https://$exception_host/codeword?(/*|)) continue;; esac
    echo "$lnk"
done

printf "\nIn-line filtering sanywhere in path - \n\n"

for lnk in "${links[@]}"; do
    case "$lnk" in
         https://$exception_host/codeword)     continue;;
         https://$exception_host/codeword/*)   continue;;
         https://$exception_host/*/codeword)   continue;;
         https://$exception_host/*/codeword/*) continue;;
    esac
    echo "$lnk"
done

echo

Running it:

$: ./tst

Unfiltered list:

https://www.yahoo.com/
https://www.yahoo.com/codeword/lalala1
https://www.yahoo.com/nocodeword
https://www.bing.com/
https://www.yahoo.com/codeword
https://www.yahoo.com/codewordbar
https://www.yahoo.com/codeword/lalala2
https://www.yahoo.com/lalala2/codeword
https://www.google.com/codeword/

Pre-filtering only at start of path -

https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.bing.com/
https://www.yahoo.com/codewordbar
https://www.yahoo.com/lalala2/codeword
https://www.google.com/codeword/

Pre-filtering anywhere in path -

https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.bing.com/
https://www.yahoo.com/codewordbar
https://www.google.com/codeword/

In-line filtering at start of path only -

https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.bing.com/
https://www.yahoo.com/codewordbar
https://www.yahoo.com/lalala2/codeword
https://www.google.com/codeword/

In-line filtering sanywhere in path -

https://www.yahoo.com/
https://www.yahoo.com/nocodeword
https://www.bing.com/
https://www.yahoo.com/codewordbar
https://www.google.com/codeword/

I'd recommend adding some flexibility of protocol and case sensitivity. Will add these if I get time later.