Unix command to remove newline within a field-CodePudding

Receiving a flat file as a source. In one of the fields, the value is segregated into new lines, but there is a need to break the newline and combine it into a single content.

Ex: File is as below:

PO,MISC,yes,"This
is
an
example"
PO,MISC,yes,"This
is
another
example"

In the above ex, the data is being read as 9 lines, but we need the input to be read as a single line, as shown below -

PO, MISC, yes, "This is an example"
PO, MISC, yes, "This is another example"

Tried via the below syntax but did not succeed. Is there any way to achieve this? I also need to print the file contents into another file.

Syntax:

awk -v RS='([^,] \\,){4}[^,] \n' '{gsub(/\n/,"",RT); print RT}' sample_attachments.csv > test.csv

CodePudding user response：

With your shown samples Only, please try following awk, written and tested in GNU awk. Simple explanation would be, setting RS to \"\n and setting field separator as ,. In main block Globally substituting new lines with spaces in $NF. Then using printf printing current line along with value of RT.

awk -v RS="\"\n"  'BEGIN{FS=OFS=","} {gsub(/\n/," ",$NF);printf("%s",$0 RT)}' Input_file

CodePudding user response：

awk -F"," '
   BEGIN{ getline; n=NF; print}
   { split($0,a,FS); 
     while(length(a)<=n){ 
       s=$0; 
       getline; 
       $0=s " " $0; 
       split($0,a,FS); 
     } 
     print $0 }' sample_attachements.txt

BEGIN(....) store the number of fields in the variable n
while the number of fields (length of array a) is unequal n, read another line, and append it to the input.
print $0 finally print the (modified)input line

CodePudding user response：

I would harness GNU AWK for this task following way, let file.txt content be

field1, field2, field3, field4
PO,MISC,yes,"This
is
an
example"

then

awk 'BEGIN{RS="";FPAT=".";OFS=""}{for(i=1;i<=NF;i =1){cnt =($i=="\"");if($i=="\n"&&cnt%2){$i=" "}};print}' file.txt

gives output

field1, field2, field3, field4
PO,MISC,yes,"This is an example"

Assumptions: there is never more than 1 newline in succession, " are never nested, Explanation: I inform GNU AWK to enter paragraph more, that is treat everything between blank lines as one row and that field pattern is ., i.e. every character is field and that output field separator is empty string. Then I iterate over characters, if I encounter " I increase cnt by 1, which is used for dead-reckoning if I am outside "..." or inside "...", when I encounter newline character and cnt is odd I am inside so I swap that for space character. After all character are processed I print them.

(tested in gawk 4.2.1)

CodePudding user response：

You may use this gnu-awk solution:

awk -v RS='"[^"]*"' '{ORS = gensub(/\n/, " ", "g", RT)} 1' file

field1, field2, field3, field4
PO,MISC,yes,"This is an example"
PO,MISC,no,"This is  another example"

Where input file is this:

cat file
field1, field2, field3, field4
PO,MISC,yes,"This
is
an
example"
PO,MISC,no,"This
is
another
example"

For the updated question use this awk:

awk -F, -v OFS=", " -v RS='"[^"]*"|\n' '{
   ORS = gensub(/\n(.)/, " \\1", "g", RT)
   $1 = $1
} 1' file

field1,  field2,  field3,  field4
PO, MISC, yes, "This is an example"
PO, MISC, no, "This is  another example"

CodePudding user response：

Using any awk:

$ awk -v RS='"' -v ORS= '!(NR%2){gsub(/\n/,OFS); $0="\"" $0 "\""} 1' file
PO,MISC,yes,"This is an example"
PO,MISC,yes,"This is another example"

For anything else, see whats-the-most-robust-way-to-efficiently-parse-csv-using-awk.

CodePudding user response：

Your input file name is assumed to be "file" here, and output is "newfile."

#!/bin/sh -x

cp file stack
cat > ed1 <<EOF
1,4w f1
1,4d
wq
EOF

next () {
[[ -s stack ]] && main
end
}

main () {
ed -s stack < ed1
cat f1 | tr '\n' ' ' >> newfile
next
}

end () {
rm -v ./ed1
rm -v ./f1
rm -v ./stack
}

next