Extract words within curly quotes but keep it when used as apostrophe-CodePudding

I have a UTF-8 file which has curly quotes ‘Awaara’ like these and in some places curly quotes are used such as don’t and don't' . The issue arises when trying to convert these curly quotes to single quotes. After converting to single quotes, I am unable to extract the single quotes words 'Awaara' without removing all single quotes used as don't , I'm.

GOAL: Convert curly--> single, remove single quotes yet keep apostrophied single quotes.

Here's the code I have written which convert yet fails to remove words within single quotes:

#!/bin/bash



cat $1 | sed -e "s/\’/'/g" -e  "s/\‘/'/g" | sed -e "s/^'/ /g" -e "s/'$/ /g" | sed "s/\…/ /g" | tr '>' ' ' | tr '?' ' ' | tr ',' ' ' | tr ';' ' ' | tr '.' ' ' | tr '!' ' ' | tr '′' ' ' | tr ':' ' ' | sed -e "s/\[/ /g" -e "s/\]/ /g" -e 's/(/ /g' -e "s/)/ /g" | tr ' ' '\n' | sort -u | uniq | tr 'a-z' 'A-Z' >our_vocab.txt

The output is:


'AWAARA ---> Should be AWAARA
25
50
70
800
A
AD
AI
AMITABH
AND
ANYWAY
ARE
BACHCHAN
BECAUSE
BUT
C  
CAN
CHECK
COMPUTER
DEVAKI
DIFFICULT
.
.
. 
HOON'   --> Should be HOON

CodePudding user response：

You can use

sed -E -e "s/([[:alpha:]]['’][[:alpha:]])|['‘’]/\\1/g" \
  -e 's/[][()>?,;.!:]|′|…/ /g' "$1" | tr ' ' '\n' | sort -u | \
  tr 'a-z' 'A-Z' > our_vocab.txt

See the online demo.

I merged several tr commands into a single (second) sed command, and the ([[:alpha:]]['’][[:alpha:]])|['‘’] regex removes all '‘’ apostrophes other than those in between letters.