I have a text file of 7 tab-delimited columns. Each column has a different number of lines with values that could be duplicated. I want to remove the duplicates so that each column has only unique values for that specific column. As an example:
Input
C1 C2 C3 C4 C5 C6 C7
111 111 222 333 111 222 777
222 111 333 333 222 333 666
222 111 444 111 333 555 555
333 444 555 222 444 666 444
444 666 555 777 555 666 333
444 777 777 555 666 888 333
777 888 999 666 888
999
Output
C1 C2 C3 C4 C5 C6 C7
111 111 222 333 111 222 777
222 444 333 111 222 333 666
333 666 444 222 333 555 555
444 777 555 777 444 666 444
777 888 777 555 555 888 333
999 999 666 666
888
I figure I would need to use awk to print each column and use sort -u separately, and then paste those outputs together. So, is there a way to make a loop that for i number of columns in a text file, would print each column | sort - u, and then paste it all together?
Thanks in advance, Carlos
CodePudding user response:
Using perl
instead for its support of true multidimensional arrays:
perl -lane '
for my $n (0..$#F) {
if (!exists ${$vals[$n]}{$F[$n]}) {
push @{$cols[$n]}, $F[$n];
${$vals[$n]}{$F[$n]} = 1;
}
}
END {
for (1..$.) {
my @row;
for my $n (0..$#cols) {
push @row, shift @{$cols[$n]};
}
print join("\t", @row);
}
}' input.txt
CodePudding user response:
Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
for (colNr=1; colNr<=NF; colNr ) {
val = $colNr
if ( !seen[colNr,val] ) {
rowNr = colRowNrs[colNr]
vals[rowNr,colNr] = val
numRows = (rowNr > numRows ? rowNr : numRows)
}
}
numCols = (NF > numCols ? NF : numCols)
}
END {
for (rowNr=1; rowNr<=numRows; rowNr ) {
for (colNr=1; colNr<=numCols; colNr ) {
val = vals[rowNr,colNr]
printf "%s%s", val, (colNr<numCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
C1 C2 C3 C4 C5 C6 C7
111 111 222 333 111 222 777
222 444 333 111 222 333 666
333 666 444 222 333 555 555
444 777 555 777 444 666 444
777 888 777 555 555 888 333
999 999 666 666
888
CodePudding user response:
Assumptions
- an (
awk
) array of the entire output result will fit in memory - variable number of columns and rows
One idea consists of a (sparse) 2-dimensional array of values, where the array structure would look like:
values[<column#>][<row#>]=<unique_cell_value>
One idea using a single awk
invocation that a) requires a single pass through the input file and b) does not require any transposing/pasting (in case anyone takes Cyrus' comment/suggestion seriously):
awk '
BEGIN { FS=OFS="\t" }
{ maxNF = (NF > maxNF ? NF : maxNF) # keep track of max number of columns
for (i=1; i<=NF; i ) {
if ( $i == "" ) # ignore empty cell
continue
for (j=1; j<=ndx[i]; j ) { # loop through values already seen for this column
if ( $i == vals[i][j] ) { # and if already seen then
$i = "" # clear the current cell and
break # break out of this for/testing loop
}
}
if ( $i != "" ) { # if we got this var and the cell is not empty then
vals[i][ ndx[i]] = $i # store the new value in our array
}
}
}
END { for (j=1; j<=NR; j ) { # loop through all possible rows
pfx = ""
for (i=1; i<=maxNF; i ) { # loop through all possible columns
printf "%s%s", pfx, vals[i][j] # non-existent array entries default to ""
pfx = OFS
}
printf "\n"
}
}
' input_file
NOTE: The array of arrays structure (arr[i][j]
) requires GNU awk
otherwise we could convert to a pseudo dual index array structure of arr[i,j]
This generates:
C1 C2 C3 C4 C5 C6 C7
111 111 222 333 111 222 777
222 444 333 111 222 333 666
333 666 444 222 333 555 555
444 777 555 777 444 666 444
777 888 777 555 555 888 333
999 999 666 666
888