Home > other >  How to outer-join two CSV files, using shell script?
How to outer-join two CSV files, using shell script?

Time:02-10

I have two CSV files, like the following:

file1.csv

label,"Part-A"
"ABC mn","2.0"
"XYZ","3.0"
"PQR SN","6"

file2.csv

label,"Part-B"
"XYZ","4.0"
"LMN Wv","8"
"PQR SN","6"
"EFG","1.0"

Desired Output.csv

label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"

Currently with the below awk command i am able to combine the matching one's which have entries for label in both the files like PQR and XYZ but unable to append the ones that are not having label values present in both the files:

awk -F, 'NR==FNR{a[$1]=substr($0,length($1) 2);next} ($1 in a){print $0","a[$1]}' file1.csv file2.csv

CodePudding user response:

awk -v OFS=, '{
        if(!o1[$1]) { o1[$1]=$NF; o2[$1]="NA" } else { o2[$1]=$NF }
    } 
    END{
        for(v in o1) { print v, o1[v], o2[v] }
    }' file{1,2}

## output
LMN,8,NA
ABC,2,NA
PQR,6,6
EFG,1,NA
XYZ,3,4

I think this will do nicely.

CodePudding user response:

I would like to introduce Miller to you. It is a tool that can do a few things with a few file formats and is available as a stand-alone binary. You just have to download the archive, put the mlr executable somewhere (preferably in your PATH) and you're done with the installation.

mlr --csv \
    join -f file1.csv -j 'label' --ul --ur \
    then \
    unsparsify --fill-with 'NA' \
    then \
    sort -f 'label' \
    file2.csv
label,Part-A,Part-B
ABC,2,NA
XYZ,3,4
PQR,6,6
LMN,NA,8
EFG,NA,1

Meaning of the command parts:

  • mlr --csv
    means that you want to read CSV files and output a CSV format. As an other example, if you want to read CSV files and output a JSON format it would be mlr --icsv --ojson
  • join -f file1.csv -j 'label' --ul --ur ...... file2.csv
    means to join file1.csv and file2.csv on the field label and also emit the unmatching records of both files
  • then is Miller's way of chaining operations
  • unsparsify --fill-with 'NA'
    means to create the fields that didn't exist in each file and fill them with NA. It's needed for the records that had a uniq label
  • then sort -f 'label'
    means to sort the records on the field label

CodePudding user response:

This solution prints exactly the wished result with any AWK. Please note that the sorting algorithm is taken from the mawk manual.

# SO71053039.awk

#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP,           n,i,j,hold ) {
  n = 0
  for (j in A)
    A_SWAP[  n] = j
  for( i = 2 ; i <= n ; i  )
  {
    hold = A_SWAP[j = i]
    while ( A_SWAP[j-1] "" > "" hold )
    { j-- ; A_SWAP[j 1] = A_SWAP[j] }
    A_SWAP[j] = hold
  }
  # sentinel A_SWAP[0] = "" will be created if needed
  return n
}

BEGIN {
  FS = OFS = ","
  out = "Output.csv"

  # read file 1
  while ((getline < ARGV[1]) > 0) {
      fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i  )
        FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME1["label"]]
      LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["\"Part-A\""]
    }
  }
  close(ARGV[1])

  # read file2
  fnr = 0
  while ((getline < ARGV[2]) > 0) {
      fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i  )
        FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME1["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME2["label"]]
      LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["\"Part-B\""]
    }
  }
  close(ARGV[2])

  # print the header
  print "label" OFS "\"Part-A\"" OFS "\"Part-B\"" > out

  # get the result
  z = isort(LABEL_KEY, LABEL_KEY_SWAP)
  for (i = 1; i <= z; i  ) {
    result_string = sprintf("%s", LABEL_KEY_SWAP[i])
    if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
      result_string = sprintf("%s", result_string OFS LABEL_KEY1[LABEL_KEY_SWAP[i]] OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? LABEL_KEY2[LABEL_KEY_SWAP[i]] : "NA"))
    else
      result_string = sprintf("%s", result_string OFS "NA" OFS LABEL_KEY2[LABEL_KEY_SWAP[i]])
    print result_string > out
  }
}

Call:

awk -f SO71053039.awk file1.csv file2.csv
=> result file Output.csv with content:
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"

CodePudding user response:

Since your question was titled with "how to do ... in a shell script?" and not necessarily with awk, I'm going to recommend GoCSV, a command-line tool with several sub-commands for processing CSVs (delimited files).

It doesn't have a single command that can accomplish what you need, but you can compose a number of commands to get the correct result.

The core of this solution is the join command which can perform inner (default), left, right, and outer joins; you want an outer join to keep the non-overlapping elements:

gocsv join -c 'label' -outer file1.csv file2.csv > joined.csv
echo 'Joined'
gocsv view joined.csv
Joined
 ------- -------- ------- -------- 
| label | Part-A | label | Part-B |
 ------- -------- ------- -------- 
| ABC   | 2      |       |        |
 ------- -------- ------- -------- 
| XYZ   | 3      | XYZ   | 4      |
 ------- -------- ------- -------- 
| PQR   | 6      | PQR   | 6      |
 ------- -------- ------- -------- 
|       |        | LMN   | 8      |
 ------- -------- ------- -------- 
|       |        | EFG   | 1      |
 ------- -------- ------- -------- 

The data-part is correct, but it'll take some work to get the columns correct, and to get the NA values in there.

Here's a complete pipeline:

gocsv join -c 'label' -outer file1.csv file2.csv \
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
| gocsv select -c 'label','Part-A','Part-B' \
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
> final.csv

echo 'Final'
gocsv view final.csv

which gets us the correct, final, file:

Final pipeline
 ------- -------- -------- 
| label | Part-A | Part-B |
 ------- -------- -------- 
| ABC   | 2      | NA     |
 ------- -------- -------- 
| EFG   | NA     | 1      |
 ------- -------- -------- 
| LMN   | NA     | 8      |
 ------- -------- -------- 
| PQR   | 6      | 6      |
 ------- -------- -------- 
| XYZ   | 3      | 4      |
 ------- -------- -------- 

There's a lot going on in that pipeline, the high points are:

Merge the the two label fields

| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \

Pare-down to just the 3 columns you want

| gocsv select -c 'label','Part-A','Part-B' \

Add the NA values and sort by label

| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \

I've made a step-by-step explanation at this Gist.

CodePudding user response:

You mentioned join in the comment on my other answer, and I'd forgotten about this utility:

#!/bin/sh
rm -f *sorted.csv

# Join two files, normally inner-join only, but
# -  `-a 1 -a 2`:    include "unpaired lines" from file 1 and file 2
# -  `-1 1 -2 1`:    the first column from each is the "join column"
# -  `-o 0,1.2,2.2`: output the "join column" (0) and the second fields from files 1 and 2

join -a 1 -a 2 -1 1 -2 1 -o '0,1.2,2.2' -t, file1.csv file2.csv > joined.csv 

# Add NA values
cat joined.csv | sed 's/,,/,NA,/' | sed 's/,$/,NA/' > unsorted.csv

# Sort, pull out header first
head -n 1 unsorted.csv > sorted.csv

# Then sort remainder
tail -n  2 unsorted.csv | sort -t, -k 1 >> sorted.csv

And, here's sorted.csv

 ------- -------- -------- 
| label | Part-A | Part-B |
 ------- -------- -------- 
| ABC   | 2      | NA     |
 ------- -------- -------- 
| EFG   | NA     | 1      |
 ------- -------- -------- 
| LMN   | NA     | 8      |
 ------- -------- -------- 
| PQR   | 6      | 6      |
 ------- -------- -------- 
| XYZ   | 3      | 4      |
 ------- -------- -------- 

CodePudding user response:

We suggest gawk script which is standard Linux awk:

script.awk

NR == FNR {
  valsStr = sprintf("%s,%s", $2, "na");
  rowsArr[$1] = valsStr;
}
NR != FNR && $1 in rowsArr {
  split(rowsArr[$1],valsArr);
  valsStr = sprintf("%s,%s", valsArr[1], $2);
  rowsArr[$1] = valsStr;
  next;
}
NR != FNR {
  valsStr = sprintf("%s,%s", "na", $2);
  rowsArr[$1] = valsStr;
}
END {
  for (rowName in rowsArr) printf("%s,%s\n", rowName, rowsArr[rowName]);
}

output:

awk -F, -f script.awk input.{1,2}.txt

LMN,na,8
ABC,2,na
PQR,6,6
EFG,na,1
XYZ,3,4
label,Part-A,Part-B
  • Related