Bash script to compare and generate csv datafile-CodePudding

I have two CSV files data1.csv and data2.csv the content is something like this (with headers) :

DATA1.csv

Client Name;strnu;addr;fav
MAD01;HDGF;11;V PO
CVOJF01;HHD-;635;V T
LINKO10;DH--JDH;98;V ZZ

DATA2.csv

USER;BINin;TYPE
XXMAD01XXXHDGFXX;11;N
KJDGD;635;M
CVOJF01XXHHD;635;N

Issues :

The value of the 1st and 2nd column of DATA1.csv exist randomly in the first column of DATA2.csv. For example MAD01;HDGF exist in the first column of DATA2 ***MAD01***HDGF** (* can be alphanum and/or symbols charachter) and MAD01;HDGF might not be in the same order in the column USER of DATA2.
The value of strnum in DATA1 is equal to the value of the column BINin in DATA2
The column fav DATA1 is the same as TYPE in DATA2 because V T = M and V PO = N (some other valuses may exist but we won't need them for example line 3 of DATA1 it should be ignored)

N.B: some data may exist in a file but not the other.

my bash script needs to generate a new CSV file that should contain:

The column USER from DATA2
Client Name and strnu from DATA1
BINin from DATA2 only if it's equal to the corespondent line and value of strnu DATA1
TYPE using M N Format and making sure to respect the condition that V T = M and V PO = N

The first thing i tried was usuing grep to search for lines that exist in both files

#!/bin/sh

DATA1="${1}"
DATA2="${2}"

for i in $(cat $DATA1 | awk -F";" '{print $1".*"$2}' | sed 1d) ; do
   grep "$i" $DATA2
done

Result :

$ ./script.sh DATA1.csv DATA2.csv
MAD01;HDGF;11;V PO
XXMAD01XXXHDGFXX;11;N
CVOJF01;HHD-;635;V T
LINKO10;DH--JDH;98;V PO

Using grep and awk i could find lines that are present in DATA1 and DATA2 files but it doesn't work for all the lines and i guess it's because of the - and other special characters present in column 2 of DATA1 but they can be ignored.

I don't know how i can generate a new csv that would mix the lines present in both files but the expected generated CSV should look like this

USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;M

CodePudding user response：

This can be done in a single awk program. This is join.awk

BEGIN {
    FS = OFS = ";"
    print "USER", "Client Name", "strnu", "BINin", "TYPE"
}

FNR == 1 {next}

NR == FNR {
    strnu[$1] = $2
    next
}

{
    for (client in strnu) {
        strnu_pattern = strnu[client]
        gsub(/-/, "", strnu_pattern)

        if ($1 ~ client && $1 ~ strnu_pattern) {
            print $1, client, strnu[client], $2, $3
            break
        }
    }
}

and then

awk -f join.awk DATA1.csv DATA2.csv

outputs

USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;N

CodePudding user response：

Assumptions/understandings:

ignore lines from DATA1.csv where the fav field is not one of V T or V PO
when matching fields we need to ignore the any hyphens from the DATA1.csv fields
when matching fields the strings from DATA1.csv can show up in either order in DATA2.csv
last line of the expected output show end with 635,N

One `awk idea:

awk '
BEGIN   { FS=OFS=";"
          print "USER","Client Name","strnu","BINin","TYPE"   # print new header
        }
FNR==1  { next }                                              # skip input headers
FNR==NR { if ($4 == "V PO" || $4 == "V T") {                  # only process if fav is one of "V PO" or "V T"
             cnames[FNR]=$1                                   # save client name
             strnus[FNR]=$2                                   # save strnu
          }
          next
        }
        { for (i in cnames) {                                 # loop through array indices
              cname=cnames[i]                                 # make copy of client name ...
              strnu=strnus[i]                                 # and strnu so that we can ...
              gsub(/-/,"",cname)                              # strip hypens from both ...
              gsub(/-/,"",strnu)                              # in order to perform the comparisons ...
              if (index($1,cname) && index($1,strnu)) {       # if cname and strnu both exist in $1 then index()>=1 in both cases so ...
                 print $1,cnames[i],strnus[i],$2,$3           # print to stdout
                 next                                         # we found a match so break from loop and go to next line of input
              }
          }
        }
' DATA1.csv DATA2.csv

This generates:

USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;N