I'm trying to turn telly into tely.
I've tried
awk 'BEGIN {f="telly" ;print gensub(/(.)\\1/,"\\1","g",f)}'
and
awk 'BEGIN {f="telly" ;print gensub(/(.)\1/,"\\1","g",f)}'
but getting telly still
I'm pretty sure I can do this* (*backreferences in the match expression) in sed probably perl too. But I'm writing functions in awk as it makes processing multi-column data simpler than hacking out the columns in sed
for example I am doing different processes on a lexicon I'm working with
here is an example of some failed output. the third column of connoisseur should not have double s or n.
otto ottô ottô o-tt--ô ottô 11025
hindu hindü hindö hind--ü hndü 11250
wearily weárílý weérélê weáríl--ý wrlý 11251
nora nørá nøré nør--á nrá 11252
formulate før#mûlâtè fømûlât før#mûlât--è fr#mltè 11253
embryo embrýô embrêô e-mbr--ýô embrýô 11254
stylish stŷliŝħ stîliŝ stŷliŝ--ħ stlŝħ 11255
eruption ėrupţìòn irupŝn ė-rupţìòn ėrpţn 11256
authoritarian auπħorítã#rïán auπoréte#rêén au-πħorítã#rïán auπrt#rn 11258
untouched untóùĉħèð untéĉt u-ntóùĉħèð untĉð 11425
penry penrý penrê penr--ý pnrý 11625
maze mâzè mâz mâz--è mzè 11725
forge før#ĝè føj før#ĝ--è fr#ĝè 11825
ferrari fèŕrārï fŕrārê fèŕrār--ï frrï 12511
assailant ássâìlánt éssâlént á-ssâìlánt ásslnt 25011
corrosive còŕr0ôsivè cŕôsiv còŕr0ôsiv--è cr0svè 25111
daimler dâìmlèŕ dâmlŕ dâìml--èŕ dmlèŕ 25311
connoisseur connoíssèùŕ connoéssŕ connoíss--èùŕ cnnssèùŕ 25511
airframe ãìŕfrâmè eŕfrâm ãìŕ-frâm--è ãìŕfrmè 25911
ampersand ampèŕsand ampŕsand a-mpèŕsand ampsnd 62511
the input is 3 or four columns per line and I want to process it field by field rather than line by line. Hence the use of awk.
just for info here is a tiny snippet of the input
,"accepted","acçeptėd","1118"
,"ellis","ellis","7111"
,"woollen","wōòllén","11111"
,"hurricane","hurrícânè","11113"
,"fuelled","fûéllèd","11114"
,"groom","gröòm","11132"
,"preferring","prėfèŕriñg0","11134"
,"uttered","uttèŕèd","11138"
,"surrendered","sùŕr0endèŕèd","11141"
,"differentiate","différenţïâtè","11145"
,"exceeding","ėxc0êèdiñg0","11146"
,"groove","gröòvè","11148"
,"floppy","floppý","11163"
,"butterflies","buttèŕflîèś","11165"
,"ee","êè","11167"
,"cartoon","cār#töòn","11170"
,"slapped","slappèð","11172"
,"scattering","scattériñg0","11178"
,"jubilee","jübílêè","11179"
,"buzzing","buzziñg0","16111"
,"whipping","wħippiñg0","19111"
,"missus","missμś","21111"
,"corrosive","còŕr0ôsivè","25111"
,"alluring","állūriñg0","31110"
,"confidentially","confídenţìállý","34111"
,"antenna","antenná","35111"
,"whoosh","wħöòŝħ","41114"
,"fattened","fatténèd","49111"
,"cobble","cobblè","61116"
here is the final lines in the awk file I'm using. It uses functions directly on the fields this is why I am using awk. The third column in the output has a disambiguate function. I had put a gensub in that function that I was trying to use to 'singl-ify' the double letters with.
some code with functions in . . .
BEGIN {FS= "\"" }
{print $2,$4,disambiguate($4),isolate_terminal_vowels($4),devowelCentre(isolate_terminal_vowels($4)),$6}
thx
CodePudding user response:
Awk does not support backreferences in a regexp, here's how you'd do what you want to do in awk (updated based one new input and additional information provided):
$ cat tst.awk
function compress(oldStr, newStr,lgth,charPos,char,seen,regexp,string) {
newStr = oldStr
lgth = length(oldStr)
for (charPos=1; charPos<lgth; charPos ) {
char = substr(oldStr,charPos,1)
# for letters only: if ( (char ~ /[[:alpha:]]/) && !seen[char] ) {
if ( !seen[char] ) {
regexp = ( char ~ /[\\^]/ ? "(\\" char ")" : "[" char "]" ) " "
string = ( char == "&" ? "\\" : "" ) char
gsub(regexp,string,newStr)
}
}
return newStr
}
BEGIN { FS=OFS="\"" }
{
for (i=2; i<NF; i =2) {
$i = compress($i)
}
print
}
$ awk -f tst.awk file
,"acepted","acçeptėd","18"
,"elis","elis","71"
,"wolen","wōòlén","1"
,"huricane","hurícânè","13"
,"fueled","fûélèd","14"
,"grom","gröòm","132"
,"prefering","prėfèŕriñg0","134"
,"utered","utèŕèd","138"
,"surendered","sùŕr0endèŕèd","141"
,"diferentiate","diférenţïâtè","145"
,"exceding","ėxc0êèdiñg0","146"
,"grove","gröòvè","148"
,"flopy","flopý","163"
,"buterflies","butèŕflîèś","165"
,"e","êè","167"
,"carton","cār#töòn","170"
,"slaped","slapèð","172"
,"scatering","scatériñg0","178"
,"jubile","jübílêè","179"
,"buzing","buziñg0","161"
,"whiping","wħipiñg0","191"
,"misus","misμś","21"
,"corosive","còŕr0ôsivè","251"
,"aluring","álūriñg0","310"
,"confidentialy","confídenţìálý","341"
,"antena","antená","351"
,"whosh","wħöòŝħ","414"
,"fatened","faténèd","491"
,"coble","coblè","616"
Original answer:
$ cat tst.awk
{
for (i=1; i<=NF; i ) {
fld = $i
lgth = length($i)
delete seen
for (j=1; j<lgth; j ) {
char = substr($i,j,1)
if ( !seen[char] ) {
regexp = ( char ~ /[\\^]/ ? "(\\" char ")" : "[" char "]" ) " "
string = ( char == "&" ? "\\" : "" ) char
gsub(regexp,string,fld)
}
}
$i = fld
}
print
}
$ awk -f tst.awk file
satelites satélîtès satélîts satélîtès stlts 1257
marginaly mār#ĝínálý mājénélê mār#ĝínál-ý mr#ĝnlý 1252
stroled strôlèd strôld strôlèd strld 12512
franticaly frantícàlý frantéclê frantícàl-ý frntclý 12519
basebal bâsèbål bâsbøl bâsèbål bsbl 1257
See https://stackoverflow.com/a/29626460/1745001 for why I special-case \
and ^
when making each char literal while creating the regexp and why I escape &
for the replacement string before calling gsub()
.
Note that I'm doing the above field by field because you specifically said I want to process it field by field rather than line by line
- it'd obviously be briefer and more efficient to do it a whole line at a time.
If you truly only want to operate on letters (not numbers or punctuation) then change this:
if ( !seen[char] ) {
to this:
if ( (char ~ /[[:alpha:]]/) && !seen[char] ) {
For example note how the numbers and dashes aren't compressed in the output below:
$ cat tst.awk
{
for (i=1; i<=NF; i ) {
fld = $i
lgth = length($i)
delete seen
for (j=1; j<lgth; j ) {
char = substr($i,j,1)
if ( (char ~ /[[:alpha:]]/) && !seen[char] ) {
regexp = ( char ~ /[\\^]/ ? "(\\" char ")" : "[" char "]" ) " "
string = ( char == "&" ? "\\" : "" ) char
gsub(regexp,string,fld)
}
}
$i = fld
}
print
}
satelites satélîtès satélîts satélîtès stlts 11257
marginaly mār#ĝínálý mājénélê mār#ĝínál--ý mr#ĝnlý 12252
stroled strôlèd strôld strôlèd strld 12512
franticaly frantícàlý frantéclê frantícàl--ý frntclý 12519
basebal bâsèbål bâsbøl bâsèbål bsbl 12557
CodePudding user response:
Reading this page awk does not support back references.
To leave alone the whitespace chars (if those are the field separators) you can use sed
and match a non whitespace char followed by a backreference
sed -E 's/([^[:space:]])\1/\1/g' file
CodePudding user response:
What version of awk are you using? Some versions don't support backreferences.
You could consider using Perl with the -p (read a file and print $_ after each line) and -F (split each line into the @F list) flags to get awk-like behavior:
# cat test.txt
fo foo ffooo
bar baar bbaaarrrr
# perl -pF=' ' -e 'for (@F) {s/(.)\1/$1/g}; $_ = join(" ", @F)' test.txt
fo fo foo
bar bar baarr