I have a file with 500,000 columns and I would like to split this file into 50 files containing 10,000 columns each. Ideally, a command like split
that cuts by columns, rather than lines.
I have tried using cut
:
cut -d ' ' -f1-10000 file.txt
However, this is impractical to repeat 50 times and as the file is so big, it takes a lot of time and so I would like to only have to read the file once.
I have also tried awk
, but I can only seem to split the file into single columns:
awk -F '[\t;]' '{for(i=1; i<=NF; $((i ))) print $i >> "column" i ".txt"}' file.txt
Any ideas would be very much appreciated :)
CodePudding user response:
Play with this
awk -v le="10000" '{
for (i=1; i<=NF; i =le){
col=""
max=i le
file=i"-"max-1".txt"
for (j=i; j<max; j ){
col=col $(j) FS
}
print col > file
}
}' input_file
CodePudding user response:
You could do it like this in awk but it seems to be quite slow:
awk '
{
for( i=c=1; i<=50; i ){
o = $(c )
for( n=1; n<10000; n ) o = o OFS $(c )
print o > "part" i ".txt"
}
}
' file.txt
CodePudding user response:
I can only seem to split the file into single columns
If you are allowed to use commands other than awk
and are ok with using tab as separator then you might give a try paste
command. For example if you have files named col1.txt
, col2.txt
, col3.txt
, col4.txt
, col5.txt
each with single column you might do
paste col1.txt col2.txt col3.txt col4.txt col5.txt > col1_5.txt
to get ensemble of said columns.
Any ideas would be very much appreciated :)
Change your definition of column. Consider following GNU AWK
example, let file.txt
content be
1 2 3 4 5 6 7 8 9
10 20 30 40 50 60 70 80 90
100 200 300 400 500 600 700 800 900
then
awk 'BEGIN{OFS="---";FPAT="[^[:space:]] ([[:space:]][^[:space:]] ){2}"}{print $1,$2,$3}' file.txt
output
1 2 3---4 5 6---7 8 9
10 20 30---40 50 60---70 80 90
100 200 300---400 500 600---700 800 900
Explanation: for demonstration purposes I set output field separator (OFS
) to ---
. Then I inform GNU AWK
that it should consider following to be column: one or more non-whitespace followed by (whitespace followed by one or more non-whitespace) which is repeated twice. This way each column will hold three values (this can be easily adjusted by changing {2}
to desired value minus 1) and now you can only seem to split the file into single columns. Disclaimer: this solution assumes number of columns is evenly divisible.
(tested in gawk 4.2.1)
CodePudding user response:
One awk
idea:
awk -v n="${n}" ' # pass in bash variable "n" representing number of columns to place in a single file
{ sfx=0 # initialize file suffix/counter
for (i=1; i<=NF; i ) { # loop through all columns
if (i%n == 1) { # at beginning of new set of "n" columns ...
sfx # increment file suffix/counter
pfx="" # initialize printf column delimiter
}
printf "%s%s", pfx, $i > "outfile_" sfx ".txt"
pfx=OFS # update printf column delimiter
}
for (i=1; i<=sfx; i ) # at end of input line: loop through list of open files and ...
print "" > "outfile_" i ".txt" # append a linefeed at the end of each current output line
}
' file.txt
NOTES:
- need confirmation from OP on the field delimiter(s) at which point will need to update the answer accordingly
- OP mentions the creation of 50 files; this answer will hold all 50 files open while processing the file;
GNU awk
should have no problems maintaining 50 open file descriptors but I don't know about other flavors ofawk
Sample input file:
$ cat file.txt
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
With n=3
:
$ head outfile_*.txt
==> outfile_1.txt <==
1 2 3
1 2 3
1 2 3
==> outfile_2.txt <==
4 5 6
4 5 6
4 5 6
==> outfile_3.txt <==
7 8 9
7 8 9
7 8 9
==> outfile_4.txt <==
10
10
10
With n=4
:
$ head outfile_*.txt
==> outfile_1.txt <==
1 2 3 4
1 2 3 4
1 2 3 4
==> outfile_2.txt <==
5 6 7 8
5 6 7 8
5 6 7 8
==> outfile_3.txt <==
9 10
9 10
9 10