I have two CSV files data1.csv and data2.csv the content is something like this (with headers) :
DATA1.csv
Client Name;strnu;addr;fav
MAD01;HDGF;11;V PO
CVOJF01;HHD-;635;V T
LINKO10;DH--JDH;98;V ZZ
DATA2.csv
USER;BINin;TYPE
XXMAD01XXXHDGFXX;11;N
KJDGD;635;M
CVOJF01XXHHD;635;N
Issues :
- The value of the 1st and 2nd column of DATA1.csv exist randomly in the first column of DATA2.csv.
For example
MAD01;HDGF
exist in the first column of DATA2***MAD01***HDGF**
(*
can be alphanum and/or symbols charachter) andMAD01;HDGF
might not be in the same order in the columnUSER
of DATA2. - The value of strnum in DATA1 is equal to the value of the column BINin in DATA2
- The column fav DATA1 is the same as TYPE in DATA2 because
V T = M
andV PO = N
(some other valuses may exist but we won't need them for example line 3 of DATA1 it should be ignored)
N.B: some data may exist in a file but not the other.
my bash script needs to generate a new CSV file that should contain:
- The column
USER
from DATA2 Client Name
andstrnu
from DATA1BINin
from DATA2 only if it's equal to the corespondent line and value ofstrnu
DATA1TYPE
usingM
N
Format and making sure to respect the condition thatV T = M
andV PO = N
The first thing i tried was usuing grep to search for lines that exist in both files
#!/bin/sh
DATA1="${1}"
DATA2="${2}"
for i in $(cat $DATA1 | awk -F";" '{print $1".*"$2}' | sed 1d) ; do
grep "$i" $DATA2
done
Result :
$ ./script.sh DATA1.csv DATA2.csv
MAD01;HDGF;11;V PO
XXMAD01XXXHDGFXX;11;N
CVOJF01;HHD-;635;V T
LINKO10;DH--JDH;98;V PO
Using grep and awk i could find lines that are present in DATA1 and DATA2 files but it doesn't work for all the lines and i guess it's because of the -
and other special characters present in column 2 of DATA1 but they can be ignored.
I don't know how i can generate a new csv that would mix the lines present in both files but the expected generated CSV should look like this
USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;M
CodePudding user response:
This can be done in a single awk program. This is join.awk
BEGIN {
FS = OFS = ";"
print "USER", "Client Name", "strnu", "BINin", "TYPE"
}
FNR == 1 {next}
NR == FNR {
strnu[$1] = $2
next
}
{
for (client in strnu) {
strnu_pattern = strnu[client]
gsub(/-/, "", strnu_pattern)
if ($1 ~ client && $1 ~ strnu_pattern) {
print $1, client, strnu[client], $2, $3
break
}
}
}
and then
awk -f join.awk DATA1.csv DATA2.csv
outputs
USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;N
CodePudding user response:
Assumptions/understandings:
- ignore lines from
DATA1.csv
where thefav
field is not one ofV T
orV PO
- when matching fields we need to ignore the any hyphens from the
DATA1.csv
fields - when matching fields the strings from
DATA1.csv
can show up in either order inDATA2.csv
- last line of the expected output show end with
635,N
One `awk idea:
awk '
BEGIN { FS=OFS=";"
print "USER","Client Name","strnu","BINin","TYPE" # print new header
}
FNR==1 { next } # skip input headers
FNR==NR { if ($4 == "V PO" || $4 == "V T") { # only process if fav is one of "V PO" or "V T"
cnames[FNR]=$1 # save client name
strnus[FNR]=$2 # save strnu
}
next
}
{ for (i in cnames) { # loop through array indices
cname=cnames[i] # make copy of client name ...
strnu=strnus[i] # and strnu so that we can ...
gsub(/-/,"",cname) # strip hypens from both ...
gsub(/-/,"",strnu) # in order to perform the comparisons ...
if (index($1,cname) && index($1,strnu)) { # if cname and strnu both exist in $1 then index()>=1 in both cases so ...
print $1,cnames[i],strnus[i],$2,$3 # print to stdout
next # we found a match so break from loop and go to next line of input
}
}
}
' DATA1.csv DATA2.csv
This generates:
USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;N