I have two CSV files, like the following:
file1.csv
label,"Part-A"
"ABC mn","2.0"
"XYZ","3.0"
"PQR SN","6"
file2.csv
label,"Part-B"
"XYZ","4.0"
"LMN Wv","8"
"PQR SN","6"
"EFG","1.0"
Desired Output.csv
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"
Currently with the below awk command i am able to combine the matching one's which have entries for label in both the files like PQR and XYZ but unable to append the ones that are not having label values present in both the files:
awk -F, 'NR==FNR{a[$1]=substr($0,length($1) 2);next} ($1 in a){print $0","a[$1]}' file1.csv file2.csv
CodePudding user response:
awk -v OFS=, '{
if(!o1[$1]) { o1[$1]=$NF; o2[$1]="NA" } else { o2[$1]=$NF }
}
END{
for(v in o1) { print v, o1[v], o2[v] }
}' file{1,2}
## output
LMN,8,NA
ABC,2,NA
PQR,6,6
EFG,1,NA
XYZ,3,4
I think this will do nicely.
CodePudding user response:
I would like to introduce Miller to you. It is a tool that can do a few things with a few file formats and is available as a stand-alone binary. You just have to download the archive, put the mlr
executable somewhere (preferably in your PATH) and you're done with the installation.
mlr --csv \
join -f file1.csv -j 'label' --ul --ur \
then \
unsparsify --fill-with 'NA' \
then \
sort -f 'label' \
file2.csv
label,Part-A,Part-B
ABC,2,NA
XYZ,3,4
PQR,6,6
LMN,NA,8
EFG,NA,1
Meaning of the command parts:
mlr --csv
means that you want to read CSV files and output a CSV format. As an other example, if you want to read CSV files and output a JSON format it would bemlr --icsv --ojson
join -f file1.csv -j 'label' --ul --ur
......file2.csv
means to joinfile1.csv
andfile2.csv
on the fieldlabel
and also emit the unmatching records of both filesthen
is Miller's way of chaining operationsunsparsify --fill-with 'NA'
means to create the fields that didn't exist in each file and fill them withNA
. It's needed for the records that had a uniq labelthen
sort -f 'label'
means to sort the records on the fieldlabel
CodePudding user response:
This solution prints exactly the wished result with any AWK. Please note that the sorting algorithm is taken from the mawk manual.
# SO71053039.awk
#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP, n,i,j,hold ) {
n = 0
for (j in A)
A_SWAP[ n] = j
for( i = 2 ; i <= n ; i )
{
hold = A_SWAP[j = i]
while ( A_SWAP[j-1] "" > "" hold )
{ j-- ; A_SWAP[j 1] = A_SWAP[j] }
A_SWAP[j] = hold
}
# sentinel A_SWAP[0] = "" will be created if needed
return n
}
BEGIN {
FS = OFS = ","
out = "Output.csv"
# read file 1
while ((getline < ARGV[1]) > 0) {
fnr
if (fnr == 1) {
for (i=1; i<=NF; i )
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME1["label"]]
LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["\"Part-A\""]
}
}
close(ARGV[1])
# read file2
fnr = 0
while ((getline < ARGV[2]) > 0) {
fnr
if (fnr == 1) {
for (i=1; i<=NF; i )
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME1["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME2["label"]]
LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["\"Part-B\""]
}
}
close(ARGV[2])
# print the header
print "label" OFS "\"Part-A\"" OFS "\"Part-B\"" > out
# get the result
z = isort(LABEL_KEY, LABEL_KEY_SWAP)
for (i = 1; i <= z; i ) {
result_string = sprintf("%s", LABEL_KEY_SWAP[i])
if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
result_string = sprintf("%s", result_string OFS LABEL_KEY1[LABEL_KEY_SWAP[i]] OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? LABEL_KEY2[LABEL_KEY_SWAP[i]] : "NA"))
else
result_string = sprintf("%s", result_string OFS "NA" OFS LABEL_KEY2[LABEL_KEY_SWAP[i]])
print result_string > out
}
}
Call:
awk -f SO71053039.awk file1.csv file2.csv
=> result file Output.csv with content:
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"
CodePudding user response:
Since your question was titled with "how to do ... in a shell script?" and not necessarily with awk, I'm going to recommend GoCSV, a command-line tool with several sub-commands for processing CSVs (delimited files).
It doesn't have a single command that can accomplish what you need, but you can compose a number of commands to get the correct result.
The core of this solution is the join command which can perform inner (default), left, right, and outer joins; you want an outer join to keep the non-overlapping elements:
gocsv join -c 'label' -outer file1.csv file2.csv > joined.csv
echo 'Joined'
gocsv view joined.csv
Joined
------- -------- ------- --------
| label | Part-A | label | Part-B |
------- -------- ------- --------
| ABC | 2 | | |
------- -------- ------- --------
| XYZ | 3 | XYZ | 4 |
------- -------- ------- --------
| PQR | 6 | PQR | 6 |
------- -------- ------- --------
| | | LMN | 8 |
------- -------- ------- --------
| | | EFG | 1 |
------- -------- ------- --------
The data-part is correct, but it'll take some work to get the columns correct, and to get the NA
values in there.
Here's a complete pipeline:
gocsv join -c 'label' -outer file1.csv file2.csv \
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
| gocsv select -c 'label','Part-A','Part-B' \
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
> final.csv
echo 'Final'
gocsv view final.csv
which gets us the correct, final, file:
Final pipeline
------- -------- --------
| label | Part-A | Part-B |
------- -------- --------
| ABC | 2 | NA |
------- -------- --------
| EFG | NA | 1 |
------- -------- --------
| LMN | NA | 8 |
------- -------- --------
| PQR | 6 | 6 |
------- -------- --------
| XYZ | 3 | 4 |
------- -------- --------
There's a lot going on in that pipeline, the high points are:
Merge the the two label fields
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
Pare-down to just the 3 columns you want
| gocsv select -c 'label','Part-A','Part-B' \
Add the NA values and sort by label
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
I've made a step-by-step explanation at this Gist.
CodePudding user response:
You mentioned join in the comment on my other answer, and I'd forgotten about this utility:
#!/bin/sh
rm -f *sorted.csv
# Join two files, normally inner-join only, but
# - `-a 1 -a 2`: include "unpaired lines" from file 1 and file 2
# - `-1 1 -2 1`: the first column from each is the "join column"
# - `-o 0,1.2,2.2`: output the "join column" (0) and the second fields from files 1 and 2
join -a 1 -a 2 -1 1 -2 1 -o '0,1.2,2.2' -t, file1.csv file2.csv > joined.csv
# Add NA values
cat joined.csv | sed 's/,,/,NA,/' | sed 's/,$/,NA/' > unsorted.csv
# Sort, pull out header first
head -n 1 unsorted.csv > sorted.csv
# Then sort remainder
tail -n 2 unsorted.csv | sort -t, -k 1 >> sorted.csv
And, here's sorted.csv
------- -------- --------
| label | Part-A | Part-B |
------- -------- --------
| ABC | 2 | NA |
------- -------- --------
| EFG | NA | 1 |
------- -------- --------
| LMN | NA | 8 |
------- -------- --------
| PQR | 6 | 6 |
------- -------- --------
| XYZ | 3 | 4 |
------- -------- --------
CodePudding user response:
We suggest gawk
script which is standard Linux awk
:
script.awk
NR == FNR {
valsStr = sprintf("%s,%s", $2, "na");
rowsArr[$1] = valsStr;
}
NR != FNR && $1 in rowsArr {
split(rowsArr[$1],valsArr);
valsStr = sprintf("%s,%s", valsArr[1], $2);
rowsArr[$1] = valsStr;
next;
}
NR != FNR {
valsStr = sprintf("%s,%s", "na", $2);
rowsArr[$1] = valsStr;
}
END {
for (rowName in rowsArr) printf("%s,%s\n", rowName, rowsArr[rowName]);
}
output:
awk -F, -f script.awk input.{1,2}.txt
LMN,na,8
ABC,2,na
PQR,6,6
EFG,na,1
XYZ,3,4
label,Part-A,Part-B