Get most appear phrase (not word) in a file in bash-CodePudding

My file is

cat a.txt
a
b
aa
a
a a

I am trying to get most appear phrase (not word).

my code is

tr -c '[:alnum:]' '[\n*]' < a.txt | sort | uniq -c | sort -nr
      4 a
      1 b
      1 aa
      1

I need

2 a
1 b
1 aa
1 a a

CodePudding user response：

sort a.txt | uniq -c | sort -rn

CodePudding user response：

When you say “in Bash”, I’m going to assume that no external programs are allowed in this exercise. (Also, what is a phrase? I’m going to assume that there is one phrase per line and that no extra preprocessing (such as whitespace trimming) is needed.)

frequent_phrases() {
  local -Ai phrases
  local -ai {dense_,}counts
  local phrase
  local -i count i
  while IFS= read -r phrase;  # Step 0
    do ((  phrases["${phrase}"]))
  done
  for phrase in "${!phrases[@]}"; do  # Step 1
    ((count = phrases["${phrase}"]))
    ((  counts[count]))
    local -a "phrases_$((count))"
    local -n phrases_ref="phrases_$((count))"
    phrases_ref =("${phrase}")
  done
  dense_counts=("${!counts[@]}")  # Step 2
  for ((i = ${#counts[@]} - 1; i >= 0; --i)); do  # Step 3
    ((count = dense_counts[i]))
    local -n phrases_ref="phrases_$((count))"
    for phrase in "${phrases_ref[@]}"; do
      printf '%d %s\n' "$((count))" "${phrase}"
    done
  done
}

frequent_phrases < a.txt

Steps taken by the frequent_phrases function (marked in code comments):

Read lines (phrases) into an associative array while counting their occurrences. This yields a mapping from phrases to their counts (the phrases array).
Create a reverse mapping from counts back to phrases. Obviously, this will be a “multimap”, because multiple different phrases can occur the same number of times. To avoid assumptions around separator characters disallowed in a phrase, we store lists of phrases for each count using dynamically named arrays (instead of a single array). For example, all phrases that occur 11 times will be stored in an array called phrases_11.

Besides the map inversion (from (phrase → count) to (count → phrases)), we also gather all known counts in an array called counts. Values of this array (representing how may different phrases occur a particular number of times) are somewhat useless for this task, but its keys (the counts themselves) are a useful representation of a sparse set of counts that can be (later) iterated in a sorted order.
We compact our sparse array of counts into a dense array of dense_counts for easy backward iteration. (This would be unnecessary if we were to just iterate through the counts in increasing order. A reverse order of iteration is not that easy in Bash, as long as we want to implement it efficiently, without trying all possible counts between the maximum and 1.)
We iterate through all known counts backwards (from highest to lowest) and for each count we print out all phrases that occur that number of times. Again, for example, phrases that occur 11 times will be stored in an array called phrases_11.

Just for completeness, to print out (also) the extra bits of statistics we gathered, one could extend the printf command like this:

      printf 'count: %d, phrases with this count: %d, phrase: "%s"\n' \
             "$((count))" "$((counts[count]))" "${phrase}"