Home > Net >  How can I detect double letters in AWK? (telly -> tely in AWK) backreferences in the match term?
How can I detect double letters in AWK? (telly -> tely in AWK) backreferences in the match term?

Time:11-23

I'm trying to turn telly into tely.

I've tried

awk 'BEGIN {f="telly" ;print gensub(/(.)\\1/,"\\1","g",f)}'

and

awk 'BEGIN {f="telly" ;print gensub(/(.)\1/,"\\1","g",f)}'

but getting telly still

I'm pretty sure I can do this* (*backreferences in the match expression) in sed probably perl too. But I'm writing functions in awk as it makes processing multi-column data simpler than hacking out the columns in sed

for example I am doing different processes on a lexicon I'm working with

here is an example of some failed output. the third column of connoisseur should not have double s or n.

otto ottô ottô o-tt--ô ottô 11025
hindu hindü hindö hind--ü hndü 11250
wearily weárílý weérélê weáríl--ý wrlý 11251
nora nørá nøré nør--á nrá 11252
formulate før#mûlâtè fømûlât før#mûlât--è fr#mltè 11253
embryo embrýô embrêô e-mbr--ýô embrýô 11254
stylish stŷliŝħ stîliŝ stŷliŝ--ħ stlŝħ 11255
eruption ėrupţìòn irupŝn ė-rupţìòn ėrpţn 11256
authoritarian auπħorítã#rïán auπoréte#rêén au-πħorítã#rïán auπrt#rn 11258
untouched untóùĉħèð untéĉt u-ntóùĉħèð untĉð 11425
penry penrý penrê penr--ý pnrý 11625
maze mâzè mâz mâz--è mzè 11725
forge før#ĝè føj før#ĝ--è fr#ĝè 11825
ferrari fèŕrārï fŕrārê fèŕrār--ï frrï 12511
assailant ássâìlánt éssâlént á-ssâìlánt ásslnt 25011
corrosive còŕr0ôsivè cŕôsiv còŕr0ôsiv--è cr0svè 25111
daimler dâìmlèŕ dâmlŕ dâìml--èŕ dmlèŕ 25311
connoisseur connoíssèùŕ connoéssŕ connoíss--èùŕ cnnssèùŕ 25511
airframe ãìŕfrâmè eŕfrâm ãìŕ-frâm--è ãìŕfrmè 25911
ampersand ampèŕsand ampŕsand a-mpèŕsand ampsnd 62511

the input is 3 or four columns per line and I want to process it field by field rather than line by line. Hence the use of awk.

just for info here is a tiny snippet of the input

,"accepted","acçeptėd","1118"
,"ellis","ellis","7111"
,"woollen","wōòllén","11111"
,"hurricane","hurrícânè","11113"
,"fuelled","fûéllèd","11114"
,"groom","gröòm","11132"
,"preferring","prėfèŕriñg0","11134"
,"uttered","uttèŕèd","11138"
,"surrendered","sùŕr0endèŕèd","11141"
,"differentiate","différenţïâtè","11145"
,"exceeding","ėxc0êèdiñg0","11146"
,"groove","gröòvè","11148"
,"floppy","floppý","11163"
,"butterflies","buttèŕflîèś","11165"
,"ee","êè","11167"
,"cartoon","cār#töòn","11170"
,"slapped","slappèð","11172"
,"scattering","scattériñg0","11178"
,"jubilee","jübílêè","11179"
,"buzzing","buzziñg0","16111"
,"whipping","wħippiñg0","19111"
,"missus","missμś","21111"
,"corrosive","còŕr0ôsivè","25111"
,"alluring","állūriñg0","31110"
,"confidentially","confídenţìállý","34111"
,"antenna","antenná","35111"
,"whoosh","wħöòŝħ","41114"
,"fattened","fatténèd","49111"
,"cobble","cobblè","61116"

here is the final lines in the awk file I'm using. It uses functions directly on the fields this is why I am using awk. The third column in the output has a disambiguate function. I had put a gensub in that function that I was trying to use to 'singl-ify' the double letters with.

some code with functions in . . .

BEGIN {FS= "\"" }
{print $2,$4,disambiguate($4),isolate_terminal_vowels($4),devowelCentre(isolate_terminal_vowels($4)),$6}

thx

CodePudding user response:

Awk does not support backreferences in a regexp, here's how you'd do what you want to do in awk (updated based one new input and additional information provided):

$ cat tst.awk
function compress(oldStr,       newStr,lgth,charPos,char,seen,regexp,string) {
    newStr = oldStr
    lgth = length(oldStr)
    for (charPos=1; charPos<lgth; charPos  ) {
        char = substr(oldStr,charPos,1)
        # for letters only: if ( (char ~ /[[:alpha:]]/) && !seen[char]   ) {
        if ( !seen[char]   ) {
            regexp = ( char ~ /[\\^]/ ? "(\\" char ")" : "[" char "]" ) " "
            string = ( char == "&" ? "\\" : "" ) char
            gsub(regexp,string,newStr)
        }
    }
    return newStr
}

BEGIN { FS=OFS="\"" }
{
    for (i=2; i<NF; i =2) {
        $i = compress($i)
    }
    print
}

$ awk -f tst.awk file
,"acepted","acçeptėd","18"
,"elis","elis","71"
,"wolen","wōòlén","1"
,"huricane","hurícânè","13"
,"fueled","fûélèd","14"
,"grom","gröòm","132"
,"prefering","prėfèŕriñg0","134"
,"utered","utèŕèd","138"
,"surendered","sùŕr0endèŕèd","141"
,"diferentiate","diférenţïâtè","145"
,"exceding","ėxc0êèdiñg0","146"
,"grove","gröòvè","148"
,"flopy","flopý","163"
,"buterflies","butèŕflîèś","165"
,"e","êè","167"
,"carton","cār#töòn","170"
,"slaped","slapèð","172"
,"scatering","scatériñg0","178"
,"jubile","jübílêè","179"
,"buzing","buziñg0","161"
,"whiping","wħipiñg0","191"
,"misus","misμś","21"
,"corosive","còŕr0ôsivè","251"
,"aluring","álūriñg0","310"
,"confidentialy","confídenţìálý","341"
,"antena","antená","351"
,"whosh","wħöòŝħ","414"
,"fatened","faténèd","491"
,"coble","coblè","616"

Original answer:

$ cat tst.awk
{
    for (i=1; i<=NF; i  ) {
        fld = $i
        lgth = length($i)
        delete seen
        for (j=1; j<lgth; j  ) {
            char = substr($i,j,1)
            if ( !seen[char]   ) {
                regexp = ( char ~ /[\\^]/ ? "(\\" char ")" : "[" char "]" ) " "
                string = ( char == "&" ? "\\" : "" ) char
                gsub(regexp,string,fld)
            }
        }
        $i = fld
    }
    print
}

$ awk -f tst.awk file
satelites satélîtès satélîts satélîtès stlts 1257
marginaly mār#ĝínálý mājénélê mār#ĝínál-ý mr#ĝnlý 1252
stroled strôlèd strôld strôlèd strld 12512
franticaly frantícàlý frantéclê frantícàl-ý frntclý 12519
basebal bâsèbål bâsbøl bâsèbål bsbl 1257

See https://stackoverflow.com/a/29626460/1745001 for why I special-case \ and ^ when making each char literal while creating the regexp and why I escape & for the replacement string before calling gsub().

Note that I'm doing the above field by field because you specifically said I want to process it field by field rather than line by line - it'd obviously be briefer and more efficient to do it a whole line at a time.

If you truly only want to operate on letters (not numbers or punctuation) then change this:

if ( !seen[char]   ) {

to this:

if ( (char ~ /[[:alpha:]]/) && !seen[char]   ) {

For example note how the numbers and dashes aren't compressed in the output below:

$ cat tst.awk
{
    for (i=1; i<=NF; i  ) {
        fld = $i
        lgth = length($i)
        delete seen
        for (j=1; j<lgth; j  ) {
            char = substr($i,j,1)
            if ( (char ~ /[[:alpha:]]/) && !seen[char]   ) {
                regexp = ( char ~ /[\\^]/ ? "(\\" char ")" : "[" char "]" ) " "
                string = ( char == "&" ? "\\" : "" ) char
                gsub(regexp,string,fld)
            }
        }
        $i = fld
    }
    print
}

satelites satélîtès satélîts satélîtès stlts 11257
marginaly mār#ĝínálý mājénélê mār#ĝínál--ý mr#ĝnlý 12252
stroled strôlèd strôld strôlèd strld 12512
franticaly frantícàlý frantéclê frantícàl--ý frntclý 12519
basebal bâsèbål bâsbøl bâsèbål bsbl 12557

CodePudding user response:

Reading this page awk does not support back references.

To leave alone the whitespace chars (if those are the field separators) you can use sed and match a non whitespace char followed by a backreference

sed -E 's/([^[:space:]])\1/\1/g' file

CodePudding user response:

What version of awk are you using? Some versions don't support backreferences.

You could consider using Perl with the -p (read a file and print $_ after each line) and -F (split each line into the @F list) flags to get awk-like behavior:

# cat test.txt
fo foo ffooo
bar baar bbaaarrrr
# perl -pF=' ' -e 'for (@F) {s/(.)\1/$1/g}; $_ = join(" ", @F)' test.txt
fo fo foo
bar bar baarr
  • Related