Home > Software design >  Extract all the words from a text file in bash
Extract all the words from a text file in bash

Time:11-16

I need to read all the words from a file to a variable. In addition to that I need to store each word only once. The selection will not be key sensitive so "Hello", "hello", "hElLo" and "HELLO" will count as the same word. If a word has an apostrophe, like the word "it's", it must ignore the "'s" and only count the "it" as a word.

To do that I used the following command:

#Stores the words of the file without duplicates
WORDS=`grep -o -E '\w ' $1 | sort -u -f`

The first two criteria are met but this method counts words like "it's" as two separate words "it" and "s".

CodePudding user response:

Maybe, something like that:

WORDS=$(grep -o -E "(\w|') " words.txt | sed -e "s/'.*\$//" | sort -u -f)

UPDATE

Explanations:

  • var=$(...command...) : Execute command (newer and better solution than `...command...`) and put standard output to var variable
  • grep -o -E "(\w|') " words.txt : Read file words.txt and apply grep filter
    • grep filter is : print only found tokens (-o) from extended (-E) rational expression (\w|') . This expression is form extract characters of words (\w : synonym of [_[:alnum:]], alnum is for alpha-numeric characters like [0-9a-zA-Z] for english/american but extended to many other characters for other languages) or (|) simple cote ('), one or more times ( ) : see man grep
  • The standard ouptut of grep is the standard input of next command sed with the pipe (|)
  • sed -e "s/'.*\$//" : Execute (-e) expression s/'.*\$// :
    • sed expression is substitution (s/) of '.*\$ (simple cote followed by zero or any characters to the end of line) by empty string (between the last two slashes (//)) : see man sed
  • The standard ouptut of sed is the standard input of next command sort with the pipe (|)
  • sort the result of sed and remove doubles (-u : uniq) and do not make a differences between upper and lower characters (case) : see man sort
  •  Tags:  
  • bash
  • Related