Home > Enterprise >  Remove non matching values in csv column
Remove non matching values in csv column

Time:11-10

I need to validate and clean a field in CSV. There is column for IP address and I need to remove only invalid data inside that column.

I tried the following command :

awk 'BEGIN{ FS=OFS="," }{ gsub(/^([0-9]{1,3}[\.]){3}[0-9]{1,3}$/,"", $3) }1' input.csv

Input file

anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3 male,usa

Current output

anna,new york,,usa
james,denver,,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3 male,usa

Expected output

anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa

This command remove the matching data, but I need the opposite. How do I remove only the non-matching data in the IP column ?

CodePudding user response:

How do I remove only the non-matching data in the IP column ?

You might combine following string functions: match substr for this task following way

anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3 male,usa

then

awk 'BEGIN{FS=OFS=","}{$3=match($3,/([0-9]{1,3}[\.]){3}[0-9]{1,3}/)?substr($3,RSTART,RLENGTH):"";print}' file.txt

gives output

anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa

Explanation: I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for each line I use so called ternary operator condition?valueiftrue:valueiffalse, condition is if $3 does match regular expression, observe that I altered it slightly, so it does hold if IP is somewhere inside, rather than span whole column. If match found I use substr to get substring which does correspond to match using RSTART, RLENGTH which were set by match, otherwise I use empty string. After that I print whole line.

(tested in gawk 4.2.1)

CodePudding user response:

Using sed (if it is a valid option)

$ sed -E 's/(([^,]*,){2}[0-9.]*)[[:alpha:] ] ,/\1,/' input_file
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa

CodePudding user response:

If your CSV is as simple as what you show (one line per record, no commas inside fields, no quoted fields, no leading or trailing spaces in fields...), and after removing the male in 10.2.8.3 male (is it a typo?), you could try:

$ awk -F, -v OFS=, '$3 !~ /^([0-9]{1,3}\.){3}[0-9]{1,3}$/ {$3 = ""} {print}' input.csv
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa

And if you want to check that the 3rd field is really a valid full IP address (no subnets):

$ cat filter.awk
function isIP(v) {
  if(v !~ /^([0-9]{1,3}\.){3}[0-9]{1,3}$/)
    return 0;
  split(v, a, /\./)
  for(i = 1; i <= 4 ; i  ) {
    if(a[i] > 255) {
      return 0;
    }
  }
  return 1
}
BEGIN { FS = ","; OFS = "," }
! isIP($3) {$3 = ""}
{print}

$ cat input.csv 
bob,LA,292.168.1.5,usa
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3,usa

$ awk -f filter.awk input.csv 
bob,LA,,usa
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa
  • Related