Home > Net >  Validate that csv has comma separated values in Bash
Validate that csv has comma separated values in Bash

Time:01-04

Hi I have the following csv file:

  • file1.csv:
1,2,3
asjkhfajsdkfasdj,2,iasuhdsg;ghasdgjklh
<>A<SF@!*AFSL<,ahsodgasdklgjaskd2152351235,12
hello,world,!

I want to validate that the above csv file has the following structure (3 values (any character) separated by 2 commas):

value1,value2,value3

and not any of the following patterns:

value1
value1,
value1,value2
value1,value2,
value1,value2,value3,
value1,value2,value3,value4

I tried using this command (see link here):

CSVcnt=`awk 'BEGIN{FS=","}END{print NF}' $csv_file`
echo "CSVcnt = $CSVcnt"

The above command will return 3 which is correct.


But let's consider the following scenario using the following:

  • file2.csv
1,2,3,4,5,6,7,8,9,10
1,2,3
1,2,3
1,2,3

Will return 3 which is incorrect


  • file3.csv
1,2,3
1,2,3
1,2,3
1,2,3,4,5,6,7,8,9,10

Will return 10 which is correct


  • file4.csv
1,2,3
1,2,3
1,2,3,4,5,6,7,8,9,10
1,2,3,4

Will return 4 which is incorrect

So, it looks like the awk command that I'm using is based on last line only. Can you please suggest if there is a way to use the awk command to return the maximum number found in the entire file? Please note, I prefer not to use REGEX and prefer not to use a for-loop if possible. (each csv will have from 10K to 40K lines)

CodePudding user response:

If you just want to get the maximum value of NF then you can do this:

awk -F ',' 'NF > maxNF {maxNF = NF} END {print maxNF 0}' file.csv

CodePudding user response:

If you want to count the fields

awk -F, '{a[NF]  } END {for (e in a) {print e ":" a[e]}}' file.csv

produces something like

3:2
4:1
10:1

for your example

1,2,3
1,2,3
1,2,3,4,5,6,7,8,9,10
1,2,3,4

CodePudding user response:

If you don't worry about quoted fields (which can contain newlines as well as literal commas which are not field separators), maybe try a two-pass solution.

awk -F, 'BEGIN { OFS=":" }
  NR==FNR {   n[NF]; next }
  FNR == 1 { for(k in n) if (n[k] > m) { m=n[k]; f=k } }
  NF != k { print FILENAME, FNR, $0 }' file.csv file.csv

to print all the lines which have a different number of fields than the "normal" number (which is simply determined by establishing the count which is true for the largest number of lines in the file).

If you also want to print lines where the last field is empty, you can add

  (!$NF) ||

before the last condition.

A proper solution is hard to do in Awk alone, but pretty straightforward with Python; the csv module in the standard library copes with several dialects of CSV and copes correctly with quoted fields etc.

  • Related