Home > Back-end >  Bash script to compare and generate csv datafile
Bash script to compare and generate csv datafile

Time:11-18

I have two CSV files data1.csv and data2.csv the content is something like this (with headers) :

DATA1.csv

Client Name;strnu;addr;fav
MAD01;HDGF;11;V PO
CVOJF01;HHD-;635;V T
LINKO10;DH--JDH;98;V ZZ

DATA2.csv

USER;BINin;TYPE
XXMAD01XXXHDGFXX;11;N
KJDGD;635;M
CVOJF01XXHHD;635;N

Issues :

  • The value of the 1st and 2nd column of DATA1.csv exist randomly in the first column of DATA2.csv. For example MAD01;HDGF exist in the first column of DATA2 ***MAD01***HDGF** (* can be alphanum and/or symbols charachter) and MAD01;HDGF might not be in the same order in the column USER of DATA2.
  • The value of strnum in DATA1 is equal to the value of the column BINin in DATA2
  • The column fav DATA1 is the same as TYPE in DATA2 because V T = M and V PO = N (some other valuses may exist but we won't need them for example line 3 of DATA1 it should be ignored)

N.B: some data may exist in a file but not the other.

my bash script needs to generate a new CSV file that should contain:

  • The column USER from DATA2
  • Client Name and strnu from DATA1
  • BINin from DATA2 only if it's equal to the corespondent line and value of strnu DATA1
  • TYPE using M N Format and making sure to respect the condition that V T = M and V PO = N

The first thing i tried was usuing grep to search for lines that exist in both files

#!/bin/sh

DATA1="${1}"
DATA2="${2}"

for i in $(cat $DATA1 | awk -F";" '{print $1".*"$2}' | sed 1d) ; do
   grep "$i" $DATA2
done

Result :

$ ./script.sh DATA1.csv DATA2.csv
MAD01;HDGF;11;V PO
XXMAD01XXXHDGFXX;11;N
CVOJF01;HHD-;635;V T
LINKO10;DH--JDH;98;V PO

Using grep and awk i could find lines that are present in DATA1 and DATA2 files but it doesn't work for all the lines and i guess it's because of the - and other special characters present in column 2 of DATA1 but they can be ignored.

I don't know how i can generate a new csv that would mix the lines present in both files but the expected generated CSV should look like this

USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;M

CodePudding user response:

This can be done in a single awk program. This is join.awk

BEGIN {
    FS = OFS = ";"
    print "USER", "Client Name", "strnu", "BINin", "TYPE"
}

FNR == 1 {next}

NR == FNR {
    strnu[$1] = $2
    next
}

{
    for (client in strnu) {
        strnu_pattern = strnu[client]
        gsub(/-/, "", strnu_pattern)

        if ($1 ~ client && $1 ~ strnu_pattern) {
            print $1, client, strnu[client], $2, $3
            break
        }
    }
}

and then

awk -f join.awk DATA1.csv DATA2.csv

outputs

USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;N

CodePudding user response:

Assumptions/understandings:

  • ignore lines from DATA1.csv where the fav field is not one of V T or V PO
  • when matching fields we need to ignore the any hyphens from the DATA1.csv fields
  • when matching fields the strings from DATA1.csv can show up in either order in DATA2.csv
  • last line of the expected output show end with 635,N

One `awk idea:

awk '
BEGIN   { FS=OFS=";"
          print "USER","Client Name","strnu","BINin","TYPE"   # print new header
        }
FNR==1  { next }                                              # skip input headers
FNR==NR { if ($4 == "V PO" || $4 == "V T") {                  # only process if fav is one of "V PO" or "V T"
             cnames[FNR]=$1                                   # save client name
             strnus[FNR]=$2                                   # save strnu
          }
          next
        }
        { for (i in cnames) {                                 # loop through array indices
              cname=cnames[i]                                 # make copy of client name ...
              strnu=strnus[i]                                 # and strnu so that we can ...
              gsub(/-/,"",cname)                              # strip hypens from both ...
              gsub(/-/,"",strnu)                              # in order to perform the comparisons ...
              if (index($1,cname) && index($1,strnu)) {       # if cname and strnu both exist in $1 then index()>=1 in both cases so ...
                 print $1,cnames[i],strnus[i],$2,$3           # print to stdout
                 next                                         # we found a match so break from loop and go to next line of input
              }
          }
        }
' DATA1.csv DATA2.csv

This generates:

USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;N
  • Related