I have a csv file where what's supposed to be a single line, is split across several. I need help to find a way to join the lines that are split. Also, the number of fields (separated by ,) is not fixed.
A correct line has the following pattern:
X,X,X,"()",Y,H where X can be any number of fields. However, the bold part (end of the string) is fixed. Y and H are both one word.
The issue is that this line can appear as (or any variant of this):
X,X,
X, "()"
,Y,H
What I need is a way (awk, sed) of appending the lines that don't have 24 or more commas and do not end with ",Y,H, to the previous line.
Please bear in mind that it's a large file, although I have 256 GB of RAM.
Example
- Correct lines
a, b, c, "()", h, k
a, b, c, d, "()", h, k
- Same lines in the file
First line
a, b, c,
"()", h, k
Second line
a, b, c, d, "()"
, h
, k
So far I've tried this (not working):
awk '/"[:space:]*,[:space:]*[:alpha:] [:space:]*,[:space:]*[:alpha:] $/{print}' check.csv
to try to find the lines ending with ", X, Y where X and Y are words.
Also, as the minimum number of correct fields is 24, I've used:
awk 'NF<24{print}' check.csv
to filter out lines with less than 24 fields.
My idea is to detect lines that match both regular expressions and append them to the previous line.
Thank you!
CodePudding user response:
This might work for you (GNU sed):
sed '/"()", *[^,]\ , *[^,]\ $/b;:a;N;s/\n//;/"()", *[^,]\ , *[^,]\ $/!ba;P;D' file
Do not process a correct line, just bail out.
Otherwise append the next line, remove the introduced newline and try and match again.
Repeat until a match, then print/delete the first line and repeat.
CodePudding user response:
perl -lanF, -e 'push @L, grep length, @F; if ($L[-3] eq q/"()"/) { print join ",", @L; @L=() }' file
- use
-l -n -e
to loop over input lines w/o printing, append linebreaks to output - use
-a -F,
to create@F
array by splitting input on commas push @L, grep length, @F
push nonempty fields onto@L
if ($L[-3] eq q/"()"/)
- if the 3rd to last accumulated field is the magic marker:print join ",", @L
print all of@L
joined with commas@L=()
reset@L