How can one dynamically create a new csv from selected columns of another csv file?-CodePudding

I dynamically iterate through a csv file and select columns that fit the criteria I need. My CSV is separated by commas. I save these indexes to an array that looks like

echo "${cols_needed[@]}"

1 3 4 7 8

I then need to write these columns to a new file and I've tried the following cut and awk commands, however, as the array is dynamically created, I cant seem to find the right commands that can select them all at once. I have tried cut, awk and paste commands.

awk -v fields=${cols_needed[@]} 'BEGIN{ n = split(fields,f) }
    { for (i=1; i<=n;   i) printf "%s%s", $f[i], (i<n?OFS:ORS) }' test.csv

This throws an error as it cannot split the fields unless I hard code them (even then, it can only do 2), split on spaces.

fields="1 2’

I have tried to dynamically create -f parameters, but can only do so with one variable in a loop like so

for item in "${cols_needed[@]}";
    do
       cat test.csv | cut -f$item
    done

which outputs one column at a time.

And I have tried to dynamically create it with commas - input as 1,3,4,7...

cat test.csv | cut -f${cols_needed[@]};

which also does not work!

Any help is appreciated! I understand awk does not work like bash and we cannot pass variables around in the same way. I feel like I'm going around in circles a bit! Thanks in advance.

CodePudding user response：

Your first approach is ok, just:

change -v fields=${cols_needed[@]} to -v fields="${cols_needed[*]}", to pass the array as a single shell word
add FS=OFS="," to BEGIN, after splitting (you want to split on spaces, before FS is changed to ,)
ie. BEGIN {n = split(fields, f); FS=OFS=","}

Also, if there are no commas embedded in quoted csv fields, you can use cut:

IFS=,; cut -d, -f "${cols_needed[*]}" test.csv

If there are embedded commas, you can use gawk's FPAT, to only split fields on unquoted commas.

Here's an example using that.

# prepend $ to each number
for i in "${cols_needed[@]}"; do
    fields[j  ]="\$$i"
done

IFS=,

gawk -v FPAT='([^,] )|(\"[^\"] \")' -v OFS=, "{print ${fields[*]}}"

Injecting shell code in to an awk command is generally not great practice, but it's ok here IMO.

CodePudding user response：

Expanding on my comments re: passing the bash array into awk:

Passing the array in as an awk variable:

$ cols_needed=(1 3 4 7 8)
$ typeset -p cols_needed
declare -a cols_needed=([0]="1" [1]="3" [2]="4" [3]="7" [4]="8")

$ awk -v fields="${cols_needed[*]}" 'BEGIN{n=split(fields,f); for (i=1;i<=n;i  ) print i,f[i]}'
1 1
2 3
3 4
4 7
5 8

Passing the array in as a 'file' via process substitution:

$ awk 'FNR==NR{f[  n]=$1;next} END {for (i=1;i<=n;i  ) print i,f[i]}' <(printf "%s\n" "${cols_needed[@]}")
1 1
2 3
3 4
4 7
5 8

As for OP's main question of extracting a specific set of columns from a .csv file ...

Borrowing dawg's .csv file:

$ cat file.csv
1,2,3,4,5,6,7,8
11,12,13,14,15,16,17,18
21,22,23,24,25,26,27,28

Expanding on the suggestion for passing the bash array in as an awk variable:

awk -v fields="${cols_needed[*]}" '
BEGIN { FS=OFS=","
        n=split(fields,f," ")
      }
      { pfx=""
        for (i=1;i<=n;i  ) {
            printf "%s%s", pfx, $(f[i])
            pfx=OFS
        }
        print ""
      }
' file.csv

NOTE: this assumes OP has provided a valid list of column numbers; if there's some doubt as to the validity of the input (column) numbers then OP can add some logic to address said doubts (eg, are they integers? are they positive integers? do they reference a field (in file.csv) that actually exists?, etc)

This generates:

1,3,4,7,8
11,13,14,17,18
21,23,24,27,28

CodePudding user response：

Suppose you have this variable in bash:

$ echo "${cols_needed[@]}"
3 4 7 8

And this CSV file:

$ cat file.csv
1,2,3,4,5,6,7,8
11,12,13,14,15,16,17,18
21,22,23,24,25,26,27,28

You can select columns of that csv file in awk this way:

awk  '
BEGIN{FS=OFS=","}
FNR==NR{split($0, cols," "); next}
{
    s=""
    for (e=1;e<=length(cols); e  ) 
           s=e<length(cols) ? s $(cols[e]) OFS : s $(cols[e])
    print s
}' <(echo "${cols_needed[@]}") file.csv

Prints:

3,4,7,8
13,14,17,18
23,24,27,28

Or, you can do:

awk -v cw="${cols_needed[*]}" '
BEGIN{FS=OFS=","; split(cw, cols," ")}
{
    s=""
    for (e=1;e<=length(cols); e  ) 
           s=e<length(cols) ? s $(cols[e]) OFS : s $(cols[e])
    print s
}' file.csv 
# same output

BTW, you can do this entirely with cut:

cut -d ',' -f $(IFS=, ; echo "${cols_needed[*]}") file.csv
3,4,7,8
13,14,17,18
23,24,27,28