I have received a file (size > 1GB) which is CSV, separated by |
When I was importing it to bigquery we found out there are some unescaped "
in the file.
How can I fix the file by using sed
command
This|is|test only|to check quotes||123
second|line|"with quotes"|no text|""|345
third line|with|"start quote" and another " "|a word after quotes|" "|677
expected (no change in line 1,2)
This|is|test only|to check quotes||123
second|line|"with quotes"|no text|""| 345
third line|with|"start quote"" and another "" "|a word after quotes|" "|677
My idea is if there is a "
that doesn't start or ends with |
should be replaced with ""
CodePudding user response:
With Perl and lookarounds:
$ perl -pe 's/(?<!\|)"(?!\|)/""/g' ip.csv
This|is|test only|to check quotes||123
second|line|"with quotes"|no text|""|345
third line|with|"start quote"" and another "" "|a word after quotes|" "|677
If "start quote" and another " "|foo|"baz"
can be a valid input and you don't want to change the first/last double quote, use:
perl -pe 's/(?<!^)(?<!\|)"(?!\|)(?!$)/""/g'
Use perl -i
for inplace editing.
CodePudding user response:
As a nice perl
solution has been provided, here is a sed
alternative:
sed -E 's/([^|"])"([^|"])/\1""\2/g' file
The regex [^|"]
matches any characters other than |
and "
.
If a double quote is surrounded by the characters above, replace
it with ""
.