I am looking for a way to filter a (~12 Gb) largefile.txt
with long strings in each line for each of the words (one per line) in a queryfile.txt
. But afterwards, instead of outputting/saving the whole line that each query word is found in, I'd like to save only that query word and a second word which I only know the start of (e.g. "ABC") and that I know for certain is in the same line the first word was found in.
For example, if queryfile.txt
has the words:
this
next
And largefile.txt
has the lines:
this is the first line with an ABCword # contents of first line will be saved
and there is an ABCword2 in this one as well # contents of 2nd line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
(Notice that the largefile.txt
always has a word starting with ABC
included in every line. It's also impossible for one of the query words to start with "ABC")
The save file should look similar to:
this ABCword1
this ABCword2
next ABCword2
So far I've looked into other similar posts' suggestions, namely combining grep and awk, with commands similar to:
LC_ALL=C grep -f queryfile.txt largefile.txt | awk -F"," '$2~/ABC/' > results.txt
The problem is that not only is the query word not being saved but the -F"," '$2~/ABC/' command doesn't seem to be the correct one for fetching words beginning with 'ABC' either.
I also found ways of only using awk, but still haven't managed to adapt the code to save the word #2 as well instead of the whole line:
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print}' queryfile.txt largefile.txt > results.txt
CodePudding user response:
2nd attempt based on updated sample input/output in question:
$ cat tst.awk
FNR==NR { words[$1]; next }
{
queryWord = abcWord = ""
for (i=1; i<=NF; i ) {
if ( $i in words ) {
queryWord = $i
}
else if ( $i ~ /^ABC/ ) {
abcWord = $i
}
}
if ( (queryWord != "") && (abcWord != "") ) {
print queryWord, abcWord
}
}
$ awk -f tst.awk queryfile.txt largefile.txt
this ABCword
next ABCword2
Original answer:
This MAY be what you're trying to do (untested):
awk '
FNR==NR { word2lgth[$1] = length($1); next }
($1 in word2lgth) && (match(substr($0,word2lgth[$1] 1),/ ABC[[:alnum:]_] /) ) {
print substr($0,1,word2lgth[$1] 1 RSTART RLENGTH)
}
' queryfile.txt largefile.txt > results.txt
CodePudding user response:
You can actually do this entirely with sed
and shell manipulation of the query file:
pat=$(tr '\n' '|' <query_file | sed -E 's/\|$//')
sed -nE "s/.*(${pat}).*(ABC[a-zA-Z0-9]*).*/\1 \2/p" large_file
Prints:
this ABCword
next ABCword2
CodePudding user response:
This one assumes your queryfile has more entries than there are words one a line in the largefile. Also, it does not consider your comments as comments but processes them as reqular data and therefore if cut'n'pasted, the third record is a match too.
$ awk '
NR==FNR { # process queryfile
a[$0] # hash those query words
next
}
{ # process largefile
for(i=1;i<=NF && !(f1 && f2);i ) # iterate until both words found
if(!f1 && ($i in a)) # f1 holds the matching query word
f1=$i
else if(!f2 && ($i~/^ABC/)) # f2 holds the ABC starting word
f2=$i
if(f1 && f2) # if both were found
print f1,f2 # output them
f1=f2=""
}' queryfile largefile
CodePudding user response:
Using sed
in a while loop
$ cat queryfile.txt
this
next
$ cat largefile.txt
this is the first line with an ABCword # contents of this line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
$ while read -r line; do sed -n "s/.*\($line\).*\(ABC[^ ]*\).*/\1 \2/p" largefile.txt; done < queryfile.txt
this ABCword
next ABCword2