Home > Software engineering >  Heuristic sorting on text match
Heuristic sorting on text match

Time:02-23

I'd like to order results counting a goodness of match to multiple concurrent text matches. I want to count partial matches to text searches a collection of searches, e.g. vowels, bigrams, prefixes.

I want to use bash, awk, command line tools, or one-liners, without writing another script.

For example, say I want to sort by number of vowels in each word for

$ shuf --random=/usr/share/dict/words -n 10 /usr/share/dict/words  | sort
Ianus
adulation
agomensin
autologist
avellaneous
granulose
lanced
minkery
outpreen
overhysterical

I want output

7 overhysterical
7 avellaneous
6 autologist
6 adulation
5 outpreen
5 granulose
5 agomensin
4 minkery
3 lanced
3 Ianus

(please update question to add other phrasings of this problem)

CodePudding user response:

For the particular question "sort by the number of vowels", GNU awk is a fine choice:

produce_words |
gawk '
  {
    vowels = gensub(/[^aeiouy]/, "", "g", tolower($0))
    count[$0] = length(vowels)
  }
  END {
    PROCINFO["sorted_in"] = "@val_num_desc"
    for (word in count) print count[word], word
  }
'

See Using Predefined Array Scanning Orders with gawk for the PROCINFO magic.

CodePudding user response:

Using perl to count the vowels:

$ cat input.txt | perl -ne 'print lc =~ tr/aeiouy//, "\t", $_' | sort -k1,1nr
6       avellaneous
6       overhysterical
5       adulation
5       autologist
4       agomensin
4       granulose
4       outpreen
3       Ianus
3       minkery
2       lanced

Replace the Useless Use Of Cat with your shuf or whatever source of lines you're using.

The lc =~ tr/aeiouy// bit replaces vowels with themselves in a lower-cased version of the the input (To avoid the more cumbersome tr/aeiouyAEIOUY//) and returns the number of replacements. That's printed, plus a tab and the contents of the line. Then sort is used to, er, sort. I'm not sure how you got the numbers of vowels you did in your example, though.

CodePudding user response:

awk can do this.

A related problem is quite easy, count the number of lines that match some patterns: | awk '/c/{tot[$0] } /t/{tot[$0] } /v/{tot[$0] } END{for (i in tot) print tot[i],i }' | sort -rn.

$ sed -n 200,203p /usr/share/dict/words | awk '/c/{tot[$0]  } /t/{ tot[$0]   } /v/{tot[$0]  } END{for (i in tot) print tot[i],i }' | sort -rn
3 abevacuation
1 abettor
1 abettal
1 abetment

To address the whole question, to count the number of matches in each line, is a bit more tricky. If they don't include each other, in whole or in part, for example, sorting by vowel count: ... | awk '{ split($0,a,"a|e|i|o|u|y"); arr[$0] =length(a)-1 } END { for (e in arr) print arr[e],e}' | sort -rn

$ ... | 
  awk '{ split($0,a,"a|e|i|o|u|y"); arr[$0] =length(a)-1 }
       END { for (e in arr) print arr[e],e}' | 
  sort -rn

Starting from 1 array, by spliting on the matches (depending on the match order, and possibly on the order within the pattern), you split the input line into a number of arrays based on the number of the matches, matching on "a", the input word "canada" splits into "c" "n" "d" "".

You can match here on multi-character patterns (put longer ones earlier in the pattern)

(Note, it seems numeric arrays start at 0, print a[x-1], in $ <<< canada awk '{split($0,a,"a"); for (x in a) print a[x-1]}')

  • Related