I realize there are a number of similar questions.... Hopefully I can get some more insight here.
I need to match a key from keywords.tsv, to a sentence in data.tsv. If the keyword exists anywhere in the sentence I want to print out both to a new file. If there are two keywords in the same sentence it should be printed twice.
keywords.tsv
color>color
colour>color
expiry>expiration
expiration>expiration
data.tsv
something>more
What is the expiry date of your credit card?>more
The credit card colour is blue and the expiry date has passed.>more
This card has a current expiration date.>more
desired result:
expiration>What is the expiry date of your credit card?>more
expiration>The credit card colour is blue and the expiry date has passed.>more
color>The credit card colour is blue and the expiry date has passed.>more
expiration>This card has a current expiration date.>more
I've tried amoung a lot of things:
awk -F "\t" 'NR==FNR{a[$1]=$2; next}
{
split($1,b,",");
for (b2 in b) { if(b[b2] == a[$1]) {print a[$1], $0}
}
}
' keywords.tsv data.tsv
I seem to be having a hard time figuring out how to access the values of the array from file1 among other issues. Help appreciated!
CodePudding user response:
I assume that >
is meant to be a tab character.
Your main issue seems to be with separators: you don't want to split on commas, you want to split on sequences of whitespace:
awk '
BEGIN {FS = OFS = "\t"}
NR == FNR {kw[$1] = $2; next}
{
n = split($1, words, /[[:blank:]] /)
for (i = 1; i <= n; i ) {
if (words[i] in kw) print kw[words[i]], $0
}
}
' keywords.tsv data.tsv
expiration What is the expiry date of your credit card? more
color The credit card colour is blue and the expiry date has passed. more
expiration The credit card colour is blue and the expiry date has passed. more
expiration This card has a current expiration date. more