Reading CSV file in Shell Scripting-CodePudding

I am trying to read values from a CSV file dynamically based on the header. Here's how my input files can look like.

File 1:
name,city,age
john,New York,20
jane,London,30

File 2: 
name,age,city,country
john,20,New York,USA
jane,30,London,England

I may not be following the best way to accomplish this but I tried the following code.

#!/bin/bash

{
read -r line
line=`tr ',' ' ' <<< $line`

while IFS=, read -r `$line`
do
 echo $name
 echo $city
 echo $age
done
} < file.txt

I am expecting the above code read the values of the header as the variable names. I know that the order of columns can be different for the input file. But, I expect the files to have name, city and age columns in the input file. Is this the right approach? If so, what is the fix for the above code if fails with the error - "line7: name: command not found".

CodePudding user response：

The issue is caused by the backticks. Bash will evaluate the contents and replace the backticks with the output from the command it just evaluated.

You can simply use the variable after the read command to achieve what you want:

#!/bin/bash

{
    read -r line
    line=`tr ',' ' ' <<< $line`

    echo "$line"

    while IFS=, read -r $line ; do
    echo "person: $name -- $city -- $age"
done
} < file.txt

Some notes on your code:

The backtick syntax is legacy syntax, it is now preferred to use $(...) to evaluate commands. The new syntax is more flexible.
You can enable automatic script failure with set -euo pipefail (see here). This will make your script stop if it encounters an error.
You code is currently very sensitive to invalid header data: with a file like

n ame,age,city,country
john,20,New York,USA
jane,30,London,England

your script (or rather the version in the beginning of my answer) will run without errors but with invalid output.

It is also good practice to quote variables to prevent unwanted splitting.

To make it much more robust, you can change it as follows:

#!/bin/bash
set -euo pipefail
# -e and -o pipefail will make the script exit
#    in case of command failure (or piped command failure)
# -u will exit in case a variable is undefined
#    (in you case, if the header is invalid)

{
    read -r line
    readarray -d, -t header < <(printf "%s" "$line")
    # using an array allows to detect if one of the header entries
    # contains an invalid character
    # the printf is needed because bash would add a newline to the
    # command input if using heredoc (<<<).

    while IFS=, read -r "${header[@]}" ; do
        echo "$name"
        echo "$city"
        echo "$age"
    done
} < file.txt

CodePudding user response：

A slightly different approach can let awk handle the field separation and ordering of the desired output given either of the input files. Below awk stores the desired output order in the f[] (field) array set in the BEGIN rule. Then on the first line in a file (FNR==1) the array a[] is deleted and filled with the headings from the current file. At that point you just loop over the field names in-order in the f[] array and output the corresponding field from the current line, e.g.

 awk -F, '
  BEGIN { f[1]="name"; f[2]="city"; f[3]="age" }  # desired order
  FNR==1 {                      # on first line read header
    delete a                    # clear a array
      for (i=1; i<=NF; i  )     # loop over headings
        a[$i] = i               # index by heading, val is field no.
    next                        # skip to next record
  }
  {
    print ""                    # optional newline between outputs
    for (i=1; i<=3; i  )        # loop over desired field order
      if (f[i] in a)            # validate field in a array
        print $a[f[i]]          # output fields value
  }
' file1 file2

Example Use/Output

In your case with the content you show in file1 and file2, you would have:

$ awk -F, '
>   BEGIN { f[1]="name"; f[2]="city"; f[3]="age" }  # desired order
>   FNR==1 {                      # on first line read header
>     delete a                    # clear a array
>       for (i=1; i<=NF; i  )     # loop over headings
>         a[$i] = i               # index by heading, val is field no.
>     next                        # skip to next record
>   }
>   {
>     print ""                    # optional newline between outputs
>     for (i=1; i<=3; i  )        # loop over desired field order
>       if (f[i] in a)            # validate field in a array
>         print $a[f[i]]          # output fields value
>   }
> ' file1 file2

john
New York
20

jane
London
30

john
New York
20

jane
London
30

Where both files are read and handled identically despite having different field orderings. Let me know if you have further questions.

CodePudding user response：

If using Bash verison ≥ 4.2, it is possible to use an associative array to capture an arbitrary number of fields with their name as a key:

#!/usr/bin/env bash

# Associative array to store columns names as keys and and values
declare -A fields

# Array to store columns names with index
declare -a column_name

# Array to store row's values
declare -a line

# Commands block consuming CSV input
{
  # Read first line to capture column names
  IFS=, read -r -a column_name

  # Proces records
  while IFS=, read -r -a line; do

    # Store column values to corresponding field name
    for ((i=0; i<${#column_name[@]}; i  )); do
      # Fills fields' associative array
      fields["${column_name[i]}"]="${line[i]}"
    done

    # Dump fields for debug|demo purpose
    # Processing of each captured value could go there instead
    declare -p fields
  done
} < file.txt

Sample output with file 1

declare -A fields=([country]="USA" [city]="New York" [age]="20" [name]="john" )
declare -A fields=([country]="England" [city]="London" [age]="30" [name]="jane" )

For older Bash version, without associative array, use indexed column name alternatively:

#!/usr/bin/env bash

# Array to store columns names with index
declare -a column_name

# Array to store values for a line
declare -a value

# Commands block consuming CSV input
{
  # Read first line to capture column names
  IFS=, read -r -a column_name

  # Proces records
  while IFS=, read -r -a value; do

    # Print record separator
    printf -- '--------------------------------------------------\n'

    # Print captured field name and value
    for ((i=0; i<"${#column_name[@]}"; i  )); do
      printf '%-18s: %s\n' "${column_name[i]}" "${value[i]}"
    done
  done
} < file.txt

Output:

--------------------------------------------------
name              : john
age               : 20
city              : New York
country           : USA
--------------------------------------------------
name              : jane
age               : 30
city              : London
country           : England