Home > Software design >  How to escape the unescaped quotes in a csv file string using sed
How to escape the unescaped quotes in a csv file string using sed

Time:09-17

I have received a file (size > 1GB) which is CSV, separated by | When I was importing it to bigquery we found out there are some unescaped " in the file. How can I fix the file by using sed command

This|is|test only|to check quotes||123
second|line|"with quotes"|no text|""|345
third line|with|"start quote" and another " "|a word after quotes|" "|677

expected (no change in line 1,2)

This|is|test only|to check quotes||123
second|line|"with quotes"|no text|""| 345
third line|with|"start quote"" and another "" "|a word after quotes|" "|677

My idea is if there is a " that doesn't start or ends with | should be replaced with ""

CodePudding user response:

With Perl and lookarounds:

$ perl -pe 's/(?<!\|)"(?!\|)/""/g' ip.csv
This|is|test only|to check quotes||123
second|line|"with quotes"|no text|""|345
third line|with|"start quote"" and another "" "|a word after quotes|" "|677

If "start quote" and another " "|foo|"baz" can be a valid input and you don't want to change the first/last double quote, use:

perl -pe 's/(?<!^)(?<!\|)"(?!\|)(?!$)/""/g'

Use perl -i for inplace editing.

CodePudding user response:

As a nice perl solution has been provided, here is a sed alternative:

sed -E 's/([^|"])"([^|"])/\1""\2/g' file

The regex [^|"] matches any characters other than | and ".
If a double quote is surrounded by the characters above, replace it with "".

  • Related