Home > Blockchain >  awk ignore the field delimiter pipe inside double quotes
awk ignore the field delimiter pipe inside double quotes

Time:07-30

I know this question is already answered but with comma as a separator. How to make awk ignore the field delimiter inside double quotes?

But My file is separated by pipe, when I use this in regex it act as a regex only and not getting proper output. I do not use awk extensively.. my requirement is add single slash before pipe character if it is coming in value.

As file size is almost 5GB, thought to select particular column and escaped the pipe.

INPUT:

"first | last | name" |" steve | white | black"| exp | 12
school |" home | school "| year | 2016
company |" private ltd "| joining | 2019

Expected Output:

"first \| last \| name" |" steve \| white \| black "| exp | 12
school |" home \| school "| year | 2016
company |" private ltd "| joining | 2019

I tried to use gawk with gsub but no luck.. is there any alternate approach for the same?

Also if I have to check in multiple columns how I can do that?

CodePudding user response:

Assumptions:

  • can have more than one field with embedded | character (said field will be wrapped in double quotes)
  • there may be more than one embedded | character in a single field
  • double quotes do not show up as embedded characters within other double quotes

Setup:

$ cat pipe.dat
name |" steve | white "| exp | 12
school |" home | school "| year | 2016
company |" private ltd "| joining | 2019
food |"pipe | one"|"pipe | two and | three"| 2022        # multiple double-quoted fields, multiple pipes between double quotes
cars | camaro | chevy | 2033                             # no double quotes

NOTE: comments added here to highlight new cases

One awk idea:

awk '
BEGIN { FS=OFS="\"" }              # define field delimiters as double quote
      { for (i=2;i<=NF;i =2)       # double quoted data resides in the even numbered fields
            gsub(/\|/,"\\|",$i)    # escape all pipe characters in field #i
        print
      }
' pipe.dat

This generates:

name |" steve \| white "| exp | 12
school |" home \| school "| year | 2016
company |" private ltd "| joining | 2019
food |"pipe \| one"|"pipe \| two and \| three"| 2022
cars | camaro | chevy | 2033

Assuming no spaces between the | delimiter and double quotes ...

One GNU awk idea (using the FPAT feature):

awk -v FPAT='([^|]*)|("[^"] ")' '
BEGIN { OFS="|" }
      { for (i=1;i<=NF;i  )
            gsub(/\|/,"\\|",$i)
        print
      }
' pipe.dat

This also generates:

name |" steve \| white "| exp | 12
school |" home \| school "| year | 2016
company |" private ltd "| joining | 2019
food |"pipe \| one"|"pipe \| two and \| three"| 2022
cars | camaro | chevy | 2033

CodePudding user response:

Using awk

$ awk 'BEGIN{FS=OFS="\""} {sub(/\|/,"\\|",$2)}1' input_file
name |" steve \| white "| exp | 12
school |" home \| school "| year | 2016
company |" private ltd "| joining | 2019

Using sed (if applicable)

$ sed -E 's/("[^|]*)(\|[^"]*")/\1\\\2/' input_file
name |" steve \| white "| exp | 12
school |" home \| school "| year | 2016
company |" private ltd "| joining | 2019
  • Related