I have a UTF-8 file which has curly quotes ‘Awaara’
like these and in some places curly quotes are used such as don’t
and don't'
. The issue arises when trying to convert these curly quotes to single quotes. After converting to single quotes, I am unable to extract the single quotes words 'Awaara'
without removing all single quotes used as don't , I'm
.
GOAL: Convert curly--> single, remove single quotes yet keep apostrophied single quotes.
Here's the code I have written which convert yet fails to remove words within single quotes:
#!/bin/bash
cat $1 | sed -e "s/\’/'/g" -e "s/\‘/'/g" | sed -e "s/^'/ /g" -e "s/'$/ /g" | sed "s/\…/ /g" | tr '>' ' ' | tr '?' ' ' | tr ',' ' ' | tr ';' ' ' | tr '.' ' ' | tr '!' ' ' | tr '′' ' ' | tr ':' ' ' | sed -e "s/\[/ /g" -e "s/\]/ /g" -e 's/(/ /g' -e "s/)/ /g" | tr ' ' '\n' | sort -u | uniq | tr 'a-z' 'A-Z' >our_vocab.txt
The output is:
'AWAARA ---> Should be AWAARA
25
50
70
800
A
AD
AI
AMITABH
AND
ANYWAY
ARE
BACHCHAN
BECAUSE
BUT
C
CAN
CHECK
COMPUTER
DEVAKI
DIFFICULT
.
.
.
HOON' --> Should be HOON
CodePudding user response:
You can use
sed -E -e "s/([[:alpha:]]['’][[:alpha:]])|['‘’]/\\1/g" \
-e 's/[][()>?,;.!:]|′|…/ /g' "$1" | tr ' ' '\n' | sort -u | \
tr 'a-z' 'A-Z' > our_vocab.txt
See the online demo.
I merged several tr
commands into a single (second) sed
command, and the ([[:alpha:]]['’][[:alpha:]])|['‘’]
regex removes all '‘’
apostrophes other than those in between letters.