Is there a way to search for a word in a file and save both said word and another word I know the be-CodePudding

I am looking for a way to filter a (~12 Gb) largefile.txt with long strings in each line for each of the words (one per line) in a queryfile.txt. But afterwards, instead of outputting/saving the whole line that each query word is found in, I'd like to save only that query word and a second word which I only know the start of (e.g. "ABC") and that I know for certain is in the same line the first word was found in.

For example, if queryfile.txt has the words:

this
next

And largefile.txt has the lines:

this is the first line with an ABCword  # contents of first line will be saved
and there is an ABCword2 in this one as well  # contents of 2nd line will be saved
and the next line has an ABCword2 too  # contents of this line will be saved as well
third line has an ABCword3    # contents of this line won't

(Notice that the largefile.txt always has a word starting with ABC included in every line. It's also impossible for one of the query words to start with "ABC")

The save file should look similar to:

this ABCword1
this ABCword2
next ABCword2

So far I've looked into other similar posts' suggestions, namely combining grep and awk, with commands similar to:

LC_ALL=C grep -f queryfile.txt largefile.txt | awk -F"," '$2~/ABC/' > results.txt

The problem is that not only is the query word not being saved but the -F"," '$2~/ABC/' command doesn't seem to be the correct one for fetching words beginning with 'ABC' either.

I also found ways of only using awk, but still haven't managed to adapt the code to save the word #2 as well instead of the whole line:

awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print}' queryfile.txt largefile.txt > results.txt

CodePudding user response：

2nd attempt based on updated sample input/output in question:

$ cat tst.awk
FNR==NR { words[$1]; next }
{
    queryWord = abcWord = ""
    for (i=1; i<=NF; i  ) {
        if ( $i in words ) {
            queryWord = $i
        }
        else if ( $i ~ /^ABC/ ) {
            abcWord = $i
        }
    }
    if ( (queryWord != "") && (abcWord != "") ) {
        print queryWord, abcWord
    }
}

$ awk -f tst.awk queryfile.txt largefile.txt
this ABCword
next ABCword2

Original answer:

This MAY be what you're trying to do (untested):

awk '
    FNR==NR { word2lgth[$1] = length($1); next }
    ($1 in word2lgth) && (match(substr($0,word2lgth[$1] 1),/ ABC[[:alnum:]_] /) ) {
        print substr($0,1,word2lgth[$1] 1 RSTART RLENGTH)
    }
' queryfile.txt largefile.txt > results.txt

CodePudding user response：

You can actually do this entirely with sed and shell manipulation of the query file:

pat=$(tr '\n' '|' <query_file | sed -E 's/\|$//')
sed -nE "s/.*(${pat}).*(ABC[a-zA-Z0-9]*).*/\1 \2/p" large_file

Prints:

this ABCword
next ABCword2

CodePudding user response：

This one assumes your queryfile has more entries than there are words one a line in the largefile. Also, it does not consider your comments as comments but processes them as reqular data and therefore if cut'n'pasted, the third record is a match too.

$ awk '
NR==FNR {                              # process queryfile
    a[$0]                              # hash those query words
    next
}
{                                      # process largefile
    for(i=1;i<=NF && !(f1 && f2);i  )  # iterate until both words found
        if(!f1 && ($i in a))           # f1 holds the matching query word
            f1=$i
        else if(!f2 && ($i~/^ABC/))    # f2 holds the ABC starting word 
            f2=$i
    if(f1 && f2)                       # if both were found
        print f1,f2                    # output them 
    f1=f2=""
}' queryfile largefile

CodePudding user response：

Using sed in a while loop

$ cat queryfile.txt
this
next


$ cat largefile.txt
this is the first line with an ABCword # contents of this line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't

$ while read -r line; do sed -n "s/.*\($line\).*\(ABC[^ ]*\).*/\1 \2/p" largefile.txt; done < queryfile.txt
this ABCword
next ABCword2