Save columns of data from a text file as separate files-CodePudding

I am looking for a way to take a text file that has columns of data separated by tab spaces:

file.txt

abcd    abcd    abcd
efgh    efgh    efgh
ijkl    ijkl    ijkl
mnop    mnop    mnop
qrst    qrst    qrst

And using awk I would like to save each column of data as its own text file, using numbers as file names.

but the problem is there is no way of predicting the number of columns of the text that they will contain,

The only thing I know is that the columns will be separated by tab spaces.

Such That

awk '{ print $1 }' file

will print column one

and:

awk '{ print $2 }' file

will print column two

However I am looking to save each column as is own file.

The number of columns could be anything up into the 100's

CodePudding user response：

all output files open at the same time

One GNU awk idea:

awk '{for (i=1;i<=NF;i  ) print $i > i".out"}' file

NOTES:

this will open, and keep open, a file descriptor for each output file
many awk implementations have a limit on the number of files they can have open at a time; opening and closing files is time consuming so from a performance perspective you will want to limit the number of open and close operations
GNU awk has a fairly high limit on how many files can be opened at one time
if you have GNU awk and you receive an error stating something about too many open file descriptors then let us know and we can look at another idea (eg: running a separate awk for each set of N columns; using an in-memory solution - assuming the whole file can fit in memory)
you mention columns are separated by tab spaces; (not sure what you mean ... columns are separated by multiple tabs and spaces? columns are separated by tabs or spaces?) this answer uses awk's default field delimiter of 'white space' (multiple spaces/tabs treated as a single delimiter); if your fields are delimited by tabs, but include spaces within the fields, then change awk '{for ... to awk -F'\t' '{for ...

in-memory; one output file open at a time; vanilla awk

Assuming the input file can fit into memory:

One idea that should work for all awk flavors:

awk '
    { for (i=1;i<=NF;i  )
          cols[i]=cols[i] (FNR==1 ? "" : ORS) $i
    }
END { for (i=1;i<=NF;i  ) {
          outfile= i ".out"
          print cols[i] > outfile
          close(outfile)
      }
    }
' file

in-memory; one output file open at a time; GNU awk

Another in-memory solution using GNU awk (for multi-dimensional array support):

awk '
    { for(i=1;i<=NF;i  )
         cols[i][FNR] = $i 
    }
END { for (i=1;i<=NF;i  ) {
          outfile= i ".out"
          for (j=1;j<=FNR;j  )
              print cols[i][j] > outfile
          close(outfile)
      }
    }
' file

All 3 of these answers generate:

$ head ?.out
==> 1.out <==
abcd
efgh
ijkl
mnop
qrst

==> 2.out <==
abcd
efgh
ijkl
mnop
qrst

==> 3.out <==
abcd
efgh
ijkl
mnop
qrst

Performance review

Setup:

# create a file with 5000 lines and 500 columns; ~19.5 MBytes

awk '
BEGIN { for (i=1;i<=5000;i  ) {
            printf "%s", "col_1"
            for (j=2;j<=500;j  )
                printf "\t%s", "col_" j
            print ""
        }
      }
' > multi_column.txt

2.5 million open/close operations

Running either of the 2x answers that open/close each of 500 output files, for each of 5000 input lines, (ie, 5000 x 500 = 2.5 million open/close operations):

killed after 2 minutes and 800 lines processed
extrapolating: ~12.5 minutes to process 5000 lines
time will (obviously) vary depending on hardware (eg, Ed Morton reports his answer takes 10 minutes on his laptop)

all (500) output files open at the same time

Running the 1st answer (above):

10 seconds to generate 500 files with 5000 lines each
even if we had to limit ourselves to, say, processing 20 columns at a time ... we could make 25 passes of the input file and still complete in < 7 minutes (time could be further reduced by running some of the awk sessions in parallel)

in-memory; one output file open at a time; vanilla awk

Running the 2nd answer (above)

6 seconds to generate 500 files with 5000 lines each

in-memory; one output file open at a time; GNU awk

Running the 3rd answer (above):

3 seconds to generate 500 files with 5000 lines each
previous in-memory answer is slower due to the time required to 'find and append' the new field onto the end of the ever-increasing-in-length array entry (cols[i]=cols[i] (FNR==1 ? "" : ORS) $i)

CodePudding user response：

Portably with all awks no matter how many columns your input has:

awk -F'\t' '{
    for (i=1; i<=NF; i  ) {
        out = $i ".out"
        if ( !seen[out]   ) {
            printf "" > out
        }
        print $i >> out
        close(out)
    }
}' file

CodePudding user response：

You could do this:

awk 'NR==FNR{max=NF>max ? NF : max; next} 
{for(i=1; i<=max; i  ) {
    fn=sprintf("%s.col", i)
    print $i >> fn
    close(fn)
    }
}' file file

If your column widths are uniform, you can do one pass:

awk 'FNR==1{max=NF}
{for(i=1; i<=max; i  ) {
    fn=sprintf("%s.col", i)
    print $i >> fn
    close(fn)
    }
}' file

Either creates these files with your example:

$ head *.col
==> 1.col <==
abcd
efgh
ijkl
mnop
qrst

==> 2.col <==
abcd
efgh
ijkl
mnop
qrst

==> 3.col <==
abcd
efgh
ijkl
mnop
qrst

CodePudding user response：

Using array to avoid repeating writing process for every single line

awk '
        {
                for(i=1; i<=NF; i  ){
                  # saving columns in multi-array 
                  # i = column, NR = line
                  a[i][NR] = $(i)  
                }
        }
        END{
                # iterating through array
                for (col in a){
                 joined = ""
                 # joining lines per column
                 for (line in a[col]){
                  joined = joined a[col][line] "\n"
                 }
                 gsub(/\n$/,"",joined)
                 # write all joined lines per column at once to col-file
                 print joined > col".out"
                }
        }
' file.txt