Is it possible to cut a line from a csv if there is multiple appearances of a specific character?-CodePudding

I am working on a csv but some lines have too many separators (due to a bug with the export from an old database), making it impossible to import in my postgresql database.

My goal is to loop for each line of the csv and check if it has the exact required number of separators (which is ; in my case). If a line is incorrect, I cut it from the csv and paste it in another file.

I tried with some bash script but I could not get it right because of my lack of skills with regex.

CodePudding user response：

Do you have the ability to fix the bug in the original database export? I assume not, but it is worth asking. There is an old adage in software (as in many other fields): "Cure the disease, not the symptom." If you have the ability to fix the root cause, then it is almost always best to do that instead of working around the problem.

That said, one way of doing what you want is to read the input file line by line, then strip out all non-separator characters and count what is left:

#!/bin/bash

# Set these variables to your required values:
input_file="filename.csv"
output_file="filename.out.csv"
cut_file="filename.cut.csv"
required_count=10

# Make sure your output files start empty:
cat /dev/null > ${output_file}
cat /dev/null > ${cut_file}

while read line; do
   separators=${line//[!;]/}
   if [ ${#separators} -ne ${required_count} ]; then
     echo "${line}" >> ${cut_file}
   else
     echo "${line}" >> ${output_file}
   fi
done < ${input_file}

The magic here is the use of "parameter expansion/substitution" in two places:

${line//[!;]/}

This means:

${line : Use the variable 'line'...
// : find all instances of...
[!;] : any character that is not ';'...
/ : substitute it for...
: the empty string...
} : finish

and

${#separators} is the number of characters in the variable 'separators'.

Parameter expansion and substitution can be hard to get your head around, but is very powerful. It is well worth studying all the variations. I find these pages very helpful:

GNU Bash Manual

Linux Documentation Project

Parameter expansion and substitution is efficient because it happens inside the shell and does not require the creation of a child process to run an external utility (such as 'sed' or 'cut'). It is not so important in this case, but as your scripts get more complicated, you will want to minimize that number of child processes.

CodePudding user response：

Sample data:

$ cat db.data
a;b;c;this is a good line
d;e;f;this is a good line
g;h;i;this;is;a;bad;line
jkl;another bad line
x;y;z;this is a good line

NOTES:

'good' data contains 4x ; delimited fields
'bad' contains less/more than 4x ; delimited fields

One awk idea:

awk -F';' 'NF!=4 {print $0 > "db.bad.data"; next}1' db.data > db.good.data

This generates:

$ head db.*.data
==> db.bad.data <==
g;h;i;this;is;a;bad;line
jkl;another bad line

==> db.good.data <==
a;b;c;this is a good line
d;e;f;this is a good line
x;y;z;this is a good line