escaping meta character using awk-CodePudding

Is there better way to escape pipe(|) from the pipe separated file?

I have pipe separated file, I tried running the awk command, it is working fine only for the records who doesn't have escaped double quotes. because it is considering field separator as double quotes.

Input file:

"first | last | name" |" john | white "| age | 52
school |" ABC | USA "| year | 2016
Home | Road | year\" | 1989\" 
company |" Pvt | ltd "| joining | 2019

Code:

awk '
BEGIN { FS=OFS="\"" }              
  { for (i=2;i<=NF;i =2)      
        gsub(/\|/,"\\|",$i)    
    print
  }
' testfile.txt

Output I am getting:

"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" \| 1989\" 
company |" Pvt \| ltd "| joining | 2019

Expecting output :

"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" | 1989\" 
company |" Pvt \| ltd "| joining | 2019

In 3rd Row, it is escaping pipe after year, but it is wrong as that double quote is part of 3rd column. Can I work on particular column to escape pipe if it belongs to same column

CodePudding user response：

Looking at your example set of input & output, following awk solution may work for you:

awk '{
   p = $0
   while (match(p, /(^|[^\\])"[^"\\]*(\\.[^"\\]*)*"/)) {
      m = substr(p,RSTART 1,RLENGTH-1)
      gsub(/\|/, "\\|", m)
      buf = buf substr(p,1,RSTART) m
      p = substr(p,RSTART RLENGTH)
   }
   $0 = buf p
   buf = ""
} 1' file

"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" | 1989\"
company |" Pvt \| ltd "| joining | 2019

Alternative one-liner solution using perl:

perl -pe 's/(?<!\\)"[^"\\]*(?:\\.[^"\\]*)*"/$&=~s~\|~\\|~gr/ge' file

"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" | 1989\"
company |" Pvt \| ltd "| joining | 2019

RegEx Demo of regex used above

CodePudding user response：

One common approach would see \" replaced with a nonsensical string, the normal operation performed, then the nonsensical string replaced with \". The nonsensical string will need to be something that does not show up in the original input and it cannot include characters that are treated specially by awk (eg, & when used in string replacement functions).

One awk idea:

awk '
BEGIN { FS=OFS="\"" }
      { gsub(/\\"/,"@@%%",$0)             # replace \" with @@%% then continue with original code
        for (i=2;i<=NF;i =2)
            gsub(/\|/,"\\|",$i)
        gsub(/@@%%/,"\\\"")               # replace @@%% with \"
        print
      }
' testfile.txt

This generates:

"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" | 1989\"
company |" Pvt \| ltd "| joining | 2019

NOTES: as mentioned in comments ...

life would be much easier if the input used a common/standard format
potential solutions get messier as the input formats get messier (eg, this answer will have issues properly processing something like "is\\")