iterate over columns and only keep (print) columns in which all values match

I would like to iterate over columns in a file and print/retain the column only if all values are identical. The script would continue from the 1st column to the 2nd, etc until a mismatch among column values (strings) is found, then the loop would break and only the columns with matching values (within a given column) are printed.

each column ($i) could be tested for duplicates with the below code, but I'm struggling to figure out how to put this together in a loop:

cut -f"$i" -d " " | sort -u>tmpf; if [ $(wc -l < tmpf) = "1" ];

Here is an example of the dataset Im working with:

superkingdom:Eukaryota phylum:Arthropoda class:Insecta order:Diptera family:Culicidae genus:Anopheles species;annularis
superkingdom:Eukaryota phylum:Arthropoda class:Insecta order:Diptera family:Culicidae genus:Anopheles species;dirus
superkingdom:Eukaryota phylum:Arthropoda class:Insecta order:Diptera family:Culicidae genus:Anopheles species;dirus
superkingdom:Eukaryota phylum:Arthropoda class:Branchiopoda order:Anostraca family:Thamnocephalidae genus:Branchinella species;pinnata
superkingdom:Eukaryota phylum:Arthropoda class:Insecta order:Diptera family:Culicidae genus:Culex species;hayashii
superkingdom:Eukaryota phylum:Arthropoda class:Branchiopoda order:Diplostraca family:Daphniidae genus:Daphnia species;ambigua
superkingdom:Eukaryota phylum:Arthropoda class:Branchiopoda order:Diplostraca family:Daphniidae genus:Daphnia species;ambigua
superkingdom:Eukaryota phylum:Arthropoda class:Branchiopoda order:Diplostraca family:Daphniidae genus:Daphnia species;carinata

Iterating over the columns (sep by " "), the first two columns match across all rows, but then the 3rd column (class) does not, so the loop would stop there and only print the first two fields , e.g.

superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda

Basically, Id like to keep/print columns that have identical values, and not keep/print columns that have multiple (non identical) values.

The script would start in column/field 1 and test if all values are the same (comparing strings): if yes (as is the case in example data), then move on to column 2. Test if all values are the same in column 2 (they are), so move on to column 3. Test if all values are the same in column 3 (they are not). So, stop loop/break, and only print previous columns that had identical values.

The idea is to iterate over the fields/columns in the file and print columns up to where there is a mismatch - with some place holder code for the 'for loop' :

for ... do cut -f"$i" -d " " | sort -u>tmpf; if [ $(wc -l < tmpf) = "1" ]; then awk '{printf "%s ;", $0}' tmpf; else break; fi; done

Any help would be much appreciated!

CodePudding user response：

$ cat tst.awk
NR == 1 {
    lastCommon = split($0,firstVals)
    next
}
NR == FNR {
    for (i=1; i<=lastCommon; i  ) {
        if ($i != firstVals[i]) {
            lastCommon = i-1
            break
        }
    }
    next
}
{
    for (i=1; i<=lastCommon; i  ) {
        printf "%s%s", $i, (i<lastCommon ? OFS : ORS)
    }
}

$ awk -f tst.awk file file
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda

if your input is coming from a pipe then you need to read it into memory during the first pass before printing it in the second pass:

$ cat tst.awk
NR == 1 {
    lastCommon = split($0,firstVals)
}
{
    for (i=1; i<=lastCommon; i  ) {
        if ($i != firstVals[i]) {
            lastCommon = i-1
            break
        }
    }
    lines[NR] = $0
}
END {
    for (j=1; j<=NR; j  ) {
        split(lines[j],flds)
        for (i=1; i<=lastCommon; i  ) {
            printf "%s%s", flds[i], (i<lastCommon ? OFS : ORS)
        }
    }
}

$ cat file | awk -f tst.awk
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda

CodePudding user response：

you can do it one pass as well, what you need to notice is the lines you're printing are by definition identical, so you just need to keep one copy of the line and the count

$ awk 'NR==1 {split($0,h)} 
       NR>1  {for(i=1;!(i in d) && i<=NF;i  ) if($i!=h[i]) d[i]} 
       END   {for(r=1;r<=NR;r  ) 
                {for(i=1;!(i in d) && i<=NF;i  ) printf "%s ",h[i]; 
                 print""}}' file

superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda