Is there better way to escape pipe(|) from the pipe separated file?
I have pipe separated file, I tried running the awk command, it is working fine only for the records who doesn't have escaped double quotes. because it is considering field separator as double quotes.
Input file:
"first | last | name" |" john | white "| age | 52
school |" ABC | USA "| year | 2016
Home | Road | year\" | 1989\"
company |" Pvt | ltd "| joining | 2019
Code:
awk '
BEGIN { FS=OFS="\"" }
{ for (i=2;i<=NF;i =2)
gsub(/\|/,"\\|",$i)
print
}
' testfile.txt
Output I am getting:
"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" \| 1989\"
company |" Pvt \| ltd "| joining | 2019
Expecting output :
"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" | 1989\"
company |" Pvt \| ltd "| joining | 2019
In 3rd Row, it is escaping pipe after year, but it is wrong as that double quote is part of 3rd column. Can I work on particular column to escape pipe if it belongs to same column
CodePudding user response:
Looking at your example set of input & output, following awk
solution may work for you:
awk '{
p = $0
while (match(p, /(^|[^\\])"[^"\\]*(\\.[^"\\]*)*"/)) {
m = substr(p,RSTART 1,RLENGTH-1)
gsub(/\|/, "\\|", m)
buf = buf substr(p,1,RSTART) m
p = substr(p,RSTART RLENGTH)
}
$0 = buf p
buf = ""
} 1' file
"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" | 1989\"
company |" Pvt \| ltd "| joining | 2019
Alternative one-liner solution using perl
:
perl -pe 's/(?<!\\)"[^"\\]*(?:\\.[^"\\]*)*"/$&=~s~\|~\\|~gr/ge' file
"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" | 1989\"
company |" Pvt \| ltd "| joining | 2019
RegEx Demo of regex used above
CodePudding user response:
One common approach would see \"
replaced with a nonsensical string, the normal operation performed, then the nonsensical string replaced with \"
. The nonsensical string will need to be something that does not show up in the original input and it cannot include characters that are treated specially by awk
(eg, &
when used in string replacement functions).
One awk
idea:
awk '
BEGIN { FS=OFS="\"" }
{ gsub(/\\"/,"@@%%",$0) # replace \" with @@%% then continue with original code
for (i=2;i<=NF;i =2)
gsub(/\|/,"\\|",$i)
gsub(/@@%%/,"\\\"") # replace @@%% with \"
print
}
' testfile.txt
This generates:
"first \| last \| name" |" john \| white "| age | 52
school |" ABC \| USA "| year | 2016
Home | Road | year\" | 1989\"
company |" Pvt \| ltd "| joining | 2019
NOTES: as mentioned in comments ...
- life would be much easier if the input used a common/standard format
- potential solutions get messier as the input formats get messier (eg, this answer will have issues properly processing something like
"is\\"
)