I am looking for a way to take a text file that has columns of data separated by tab spaces:
file.txt
abcd abcd abcd
efgh efgh efgh
ijkl ijkl ijkl
mnop mnop mnop
qrst qrst qrst
And using awk I would like to save each column of data as its own text file, using numbers as file names.
but the problem is there is no way of predicting the number of columns of the text that they will contain,
The only thing I know is that the columns will be separated by tab spaces.
Such That
awk '{ print $1 }' file
will print column one
and:
awk '{ print $2 }' file
will print column two
However I am looking to save each column as is own file.
The number of columns could be anything up into the 100's
CodePudding user response:
all output files open at the same time
One GNU awk
idea:
awk '{for (i=1;i<=NF;i ) print $i > i".out"}' file
NOTES:
- this will open, and keep open, a file descriptor for each output file
- many
awk
implementations have a limit on the number of files they can have open at a time; opening and closing files is time consuming so from a performance perspective you will want to limit the number of open and close operations GNU awk
has a fairly high limit on how many files can be opened at one time- if you have
GNU awk
and you receive an error stating something about too many open file descriptors then let us know and we can look at another idea (eg: running a separateawk
for each set of N columns; using an in-memory solution - assuming the whole file can fit in memory) - you mention columns are separated by
tab spaces
; (not sure what you mean ... columns are separated by multiple tabs and spaces? columns are separated by tabs or spaces?) this answer usesawk's
default field delimiter of 'white space' (multiple spaces/tabs treated as a single delimiter); if your fields are delimited by tabs, but include spaces within the fields, then changeawk '{for ...
toawk -F'\t' '{for ...
in-memory; one output file open at a time; vanilla awk
Assuming the input file can fit into memory:
One idea that should work for all awk
flavors:
awk '
{ for (i=1;i<=NF;i )
cols[i]=cols[i] (FNR==1 ? "" : ORS) $i
}
END { for (i=1;i<=NF;i ) {
outfile= i ".out"
print cols[i] > outfile
close(outfile)
}
}
' file
in-memory; one output file open at a time; GNU awk
Another in-memory solution using GNU awk
(for multi-dimensional array support):
awk '
{ for(i=1;i<=NF;i )
cols[i][FNR] = $i
}
END { for (i=1;i<=NF;i ) {
outfile= i ".out"
for (j=1;j<=FNR;j )
print cols[i][j] > outfile
close(outfile)
}
}
' file
All 3 of these answers generate:
$ head ?.out
==> 1.out <==
abcd
efgh
ijkl
mnop
qrst
==> 2.out <==
abcd
efgh
ijkl
mnop
qrst
==> 3.out <==
abcd
efgh
ijkl
mnop
qrst
Performance review
Setup:
# create a file with 5000 lines and 500 columns; ~19.5 MBytes
awk '
BEGIN { for (i=1;i<=5000;i ) {
printf "%s", "col_1"
for (j=2;j<=500;j )
printf "\t%s", "col_" j
print ""
}
}
' > multi_column.txt
2.5 million open/close operations
Running either of the 2x answers that open/close each of 500 output files, for each of 5000 input lines, (ie, 5000 x 500 = 2.5 million
open/close operations):
- killed after 2 minutes and 800 lines processed
- extrapolating: ~12.5 minutes to process 5000 lines
- time will (obviously) vary depending on hardware (eg, Ed Morton reports his answer takes 10 minutes on his laptop)
all (500) output files open at the same time
Running the 1st answer (above):
- 10 seconds to generate 500 files with 5000 lines each
- even if we had to limit ourselves to, say, processing 20 columns at a time ... we could make 25 passes of the input file and still complete in < 7 minutes (time could be further reduced by running some of the
awk
sessions in parallel)
in-memory; one output file open at a time; vanilla awk
Running the 2nd answer (above)
- 6 seconds to generate 500 files with 5000 lines each
in-memory; one output file open at a time; GNU awk
Running the 3rd answer (above):
- 3 seconds to generate 500 files with 5000 lines each
- previous in-memory answer is slower due to the time required to 'find and append' the new field onto the end of the ever-increasing-in-length array entry (
cols[i]=cols[i] (FNR==1 ? "" : ORS) $i
)
CodePudding user response:
Portably with all awks no matter how many columns your input has:
awk -F'\t' '{
for (i=1; i<=NF; i ) {
out = $i ".out"
if ( !seen[out] ) {
printf "" > out
}
print $i >> out
close(out)
}
}' file
CodePudding user response:
You could do this:
awk 'NR==FNR{max=NF>max ? NF : max; next}
{for(i=1; i<=max; i ) {
fn=sprintf("%s.col", i)
print $i >> fn
close(fn)
}
}' file file
If your column widths are uniform, you can do one pass:
awk 'FNR==1{max=NF}
{for(i=1; i<=max; i ) {
fn=sprintf("%s.col", i)
print $i >> fn
close(fn)
}
}' file
Either creates these files with your example:
$ head *.col
==> 1.col <==
abcd
efgh
ijkl
mnop
qrst
==> 2.col <==
abcd
efgh
ijkl
mnop
qrst
==> 3.col <==
abcd
efgh
ijkl
mnop
qrst
CodePudding user response:
Using array to avoid repeating writing process for every single line
awk '
{
for(i=1; i<=NF; i ){
# saving columns in multi-array
# i = column, NR = line
a[i][NR] = $(i)
}
}
END{
# iterating through array
for (col in a){
joined = ""
# joining lines per column
for (line in a[col]){
joined = joined a[col][line] "\n"
}
gsub(/\n$/,"",joined)
# write all joined lines per column at once to col-file
print joined > col".out"
}
}
' file.txt