How can I detect double letters in AWK? (telly -> tely in AWK) backreferences in the match term?-CodePudding

I'm trying to turn telly into tely.

I've tried

awk 'BEGIN {f="telly" ;print gensub(/(.)\\1/,"\\1","g",f)}'

and

awk 'BEGIN {f="telly" ;print gensub(/(.)\1/,"\\1","g",f)}'

but getting telly still

I'm pretty sure I can do this* (*backreferences in the match expression) in sed probably perl too. But I'm writing functions in awk as it makes processing multi-column data simpler than hacking out the columns in sed

for example I am doing different processes on a lexicon I'm working with

here is an example of some failed output. the third column of connoisseur should not have double s or n.

otto ottô ottô o-tt--ô ottô 11025
hindu hindü hindö hind--ü hndü 11250
wearily weárílý weérélê weáríl--ý wrlý 11251
nora nørá nøré nør--á nrá 11252
formulate før#mûlâtè fømûlât før#mûlât--è fr#mltè 11253
embryo embrýô embrêô e-mbr--ýô embrýô 11254
stylish stŷliŝħ stîliŝ stŷliŝ--ħ stlŝħ 11255
eruption ėrupţìòn irupŝn ė-rupţìòn ėrpţn 11256
authoritarian auπħorítã#rïán auπoréte#rêén au-πħorítã#rïán auπrt#rn 11258
untouched untóùĉħèð untéĉt u-ntóùĉħèð untĉð 11425
penry penrý penrê penr--ý pnrý 11625
maze mâzè mâz mâz--è mzè 11725
forge før#ĝè føj før#ĝ--è fr#ĝè 11825
ferrari fèŕrārï fŕrārê fèŕrār--ï frrï 12511
assailant ássâìlánt éssâlént á-ssâìlánt ásslnt 25011
corrosive còŕr0ôsivè cŕôsiv còŕr0ôsiv--è cr0svè 25111
daimler dâìmlèŕ dâmlŕ dâìml--èŕ dmlèŕ 25311
connoisseur connoíssèùŕ connoéssŕ connoíss--èùŕ cnnssèùŕ 25511
airframe ãìŕfrâmè eŕfrâm ãìŕ-frâm--è ãìŕfrmè 25911
ampersand ampèŕsand ampŕsand a-mpèŕsand ampsnd 62511

the input is 3 or four columns per line and I want to process it field by field rather than line by line. Hence the use of awk.

just for info here is a tiny snippet of the input

,"accepted","acçeptėd","1118"
,"ellis","ellis","7111"
,"woollen","wōòllén","11111"
,"hurricane","hurrícânè","11113"
,"fuelled","fûéllèd","11114"
,"groom","gröòm","11132"
,"preferring","prėfèŕriñg0","11134"
,"uttered","uttèŕèd","11138"
,"surrendered","sùŕr0endèŕèd","11141"
,"differentiate","différenţïâtè","11145"
,"exceeding","ėxc0êèdiñg0","11146"
,"groove","gröòvè","11148"
,"floppy","floppý","11163"
,"butterflies","buttèŕflîèś","11165"
,"ee","êè","11167"
,"cartoon","cār#töòn","11170"
,"slapped","slappèð","11172"
,"scattering","scattériñg0","11178"
,"jubilee","jübílêè","11179"
,"buzzing","buzziñg0","16111"
,"whipping","wħippiñg0","19111"
,"missus","missμś","21111"
,"corrosive","còŕr0ôsivè","25111"
,"alluring","állūriñg0","31110"
,"confidentially","confídenţìállý","34111"
,"antenna","antenná","35111"
,"whoosh","wħöòŝħ","41114"
,"fattened","fatténèd","49111"
,"cobble","cobblè","61116"

here is the final lines in the awk file I'm using. It uses functions directly on the fields this is why I am using awk. The third column in the output has a disambiguate function. I had put a gensub in that function that I was trying to use to 'singl-ify' the double letters with.

some code with functions in . . .

BEGIN {FS= "\"" }
{print $2,$4,disambiguate($4),isolate_terminal_vowels($4),devowelCentre(isolate_terminal_vowels($4)),$6}

thx

CodePudding user response：

Awk does not support backreferences in a regexp, here's how you'd do what you want to do in awk (updated based one new input and additional information provided):

$ cat tst.awk
function compress(oldStr,       newStr,lgth,charPos,char,seen,regexp,string) {
    newStr = oldStr
    lgth = length(oldStr)
    for (charPos=1; charPos<lgth; charPos  ) {
        char = substr(oldStr,charPos,1)
        # for letters only: if ( (char ~ /[[:alpha:]]/) && !seen[char]   ) {
        if ( !seen[char]   ) {
            regexp = ( char ~ /[\\^]/ ? "(\\" char ")" : "[" char "]" ) " "
            string = ( char == "&" ? "\\" : "" ) char
            gsub(regexp,string,newStr)
        }
    }
    return newStr
}

BEGIN { FS=OFS="\"" }
{
    for (i=2; i<NF; i =2) {
        $i = compress($i)
    }
    print
}

$ awk -f tst.awk file
,"acepted","acçeptėd","18"
,"elis","elis","71"
,"wolen","wōòlén","1"
,"huricane","hurícânè","13"
,"fueled","fûélèd","14"
,"grom","gröòm","132"
,"prefering","prėfèŕriñg0","134"
,"utered","utèŕèd","138"
,"surendered","sùŕr0endèŕèd","141"
,"diferentiate","diférenţïâtè","145"
,"exceding","ėxc0êèdiñg0","146"
,"grove","gröòvè","148"
,"flopy","flopý","163"
,"buterflies","butèŕflîèś","165"
,"e","êè","167"
,"carton","cār#töòn","170"
,"slaped","slapèð","172"
,"scatering","scatériñg0","178"
,"jubile","jübílêè","179"
,"buzing","buziñg0","161"
,"whiping","wħipiñg0","191"
,"misus","misμś","21"
,"corosive","còŕr0ôsivè","251"
,"aluring","álūriñg0","310"
,"confidentialy","confídenţìálý","341"
,"antena","antená","351"
,"whosh","wħöòŝħ","414"
,"fatened","faténèd","491"
,"coble","coblè","616"

Original answer:

$ cat tst.awk
{
    for (i=1; i<=NF; i  ) {
        fld = $i
        lgth = length($i)
        delete seen
        for (j=1; j<lgth; j  ) {
            char = substr($i,j,1)
            if ( !seen[char]   ) {
                regexp = ( char ~ /[\\^]/ ? "(\\" char ")" : "[" char "]" ) " "
                string = ( char == "&" ? "\\" : "" ) char
                gsub(regexp,string,fld)
            }
        }
        $i = fld
    }
    print
}

$ awk -f tst.awk file
satelites satélîtès satélîts satélîtès stlts 1257
marginaly mār#ĝínálý mājénélê mār#ĝínál-ý mr#ĝnlý 1252
stroled strôlèd strôld strôlèd strld 12512
franticaly frantícàlý frantéclê frantícàl-ý frntclý 12519
basebal bâsèbål bâsbøl bâsèbål bsbl 1257

See https://stackoverflow.com/a/29626460/1745001 for why I special-case \ and ^ when making each char literal while creating the regexp and why I escape & for the replacement string before calling gsub().

Note that I'm doing the above field by field because you specifically said I want to process it field by field rather than line by line - it'd obviously be briefer and more efficient to do it a whole line at a time.

If you truly only want to operate on letters (not numbers or punctuation) then change this:

if ( !seen[char]   ) {

to this:

if ( (char ~ /[[:alpha:]]/) && !seen[char]   ) {

For example note how the numbers and dashes aren't compressed in the output below:

$ cat tst.awk
{
    for (i=1; i<=NF; i  ) {
        fld = $i
        lgth = length($i)
        delete seen
        for (j=1; j<lgth; j  ) {
            char = substr($i,j,1)
            if ( (char ~ /[[:alpha:]]/) && !seen[char]   ) {
                regexp = ( char ~ /[\\^]/ ? "(\\" char ")" : "[" char "]" ) " "
                string = ( char == "&" ? "\\" : "" ) char
                gsub(regexp,string,fld)
            }
        }
        $i = fld
    }
    print
}

satelites satélîtès satélîts satélîtès stlts 11257
marginaly mār#ĝínálý mājénélê mār#ĝínál--ý mr#ĝnlý 12252
stroled strôlèd strôld strôlèd strld 12512
franticaly frantícàlý frantéclê frantícàl--ý frntclý 12519
basebal bâsèbål bâsbøl bâsèbål bsbl 12557

CodePudding user response：

Reading this page awk does not support back references.

To leave alone the whitespace chars (if those are the field separators) you can use sed and match a non whitespace char followed by a backreference

sed -E 's/([^[:space:]])\1/\1/g' file

CodePudding user response：

What version of awk are you using? Some versions don't support backreferences.

You could consider using Perl with the -p (read a file and print $_ after each line) and -F (split each line into the @F list) flags to get awk-like behavior:

# cat test.txt
fo foo ffooo
bar baar bbaaarrrr
# perl -pF=' ' -e 'for (@F) {s/(.)\1/$1/g}; $_ = join(" ", @F)' test.txt
fo fo foo
bar bar baarr