Home > OS >  Can I split a large file into multiple files each containing x amount of columns (in bash)?
Can I split a large file into multiple files each containing x amount of columns (in bash)?

Time:05-05

I have a file with 500,000 columns and I would like to split this file into 50 files containing 10,000 columns each. Ideally, a command like split that cuts by columns, rather than lines.

I have tried using cut:

cut -d ' ' -f1-10000 file.txt

However, this is impractical to repeat 50 times and as the file is so big, it takes a lot of time and so I would like to only have to read the file once.

I have also tried awk, but I can only seem to split the file into single columns:

awk -F '[\t;]' '{for(i=1; i<=NF; $((i  ))) print $i >> "column" i ".txt"}' file.txt

Any ideas would be very much appreciated :)

CodePudding user response:

Play with this

awk -v le="10000" '{
    for (i=1; i<=NF; i =le){
      col=""
      max=i le
      file=i"-"max-1".txt"
      for (j=i; j<max; j  ){
        col=col $(j) FS
      }
      print col > file
    }
}' input_file

CodePudding user response:

You could do it like this in awk but it seems to be quite slow:

awk '
    {
        for( i=c=1; i<=50; i   ){
            o = $(c  )
            for( n=1; n<10000; n   ) o = o OFS $(c  )
            print o > "part" i ".txt"
        }
    }
' file.txt

CodePudding user response:

I can only seem to split the file into single columns

If you are allowed to use commands other than awk and are ok with using tab as separator then you might give a try paste command. For example if you have files named col1.txt, col2.txt, col3.txt, col4.txt, col5.txt each with single column you might do

paste col1.txt col2.txt col3.txt col4.txt col5.txt > col1_5.txt

to get ensemble of said columns.

Any ideas would be very much appreciated :)

Change your definition of column. Consider following GNU AWK example, let file.txt content be

1 2 3 4 5 6 7 8 9
10 20 30 40 50 60 70 80 90
100 200 300 400 500 600 700 800 900

then

awk 'BEGIN{OFS="---";FPAT="[^[:space:]] ([[:space:]][^[:space:]] ){2}"}{print $1,$2,$3}' file.txt

output

1 2 3---4 5 6---7 8 9
10 20 30---40 50 60---70 80 90
100 200 300---400 500 600---700 800 900

Explanation: for demonstration purposes I set output field separator (OFS) to ---. Then I inform GNU AWK that it should consider following to be column: one or more non-whitespace followed by (whitespace followed by one or more non-whitespace) which is repeated twice. This way each column will hold three values (this can be easily adjusted by changing {2} to desired value minus 1) and now you can only seem to split the file into single columns. Disclaimer: this solution assumes number of columns is evenly divisible.

(tested in gawk 4.2.1)

CodePudding user response:

One awk idea:

awk -v n="${n}" '                          # pass in bash variable "n" representing number of columns to place in a single file
    { sfx=0                                # initialize file suffix/counter

      for (i=1; i<=NF; i  ) {              # loop through all columns
          if (i%n == 1) {                  # at beginning of new set of "n" columns ...
             sfx                           # increment file suffix/counter
             pfx=""                        # initialize printf column delimiter
          }
          printf "%s%s", pfx, $i > "outfile_" sfx ".txt"
          pfx=OFS                          # update printf column delimiter
      }
      for (i=1; i<=sfx; i  )               # at end of input line: loop through list of open files and ...
          print "" > "outfile_" i ".txt"   # append a linefeed at the end of each current output line
    }
' file.txt

NOTES:

  • need confirmation from OP on the field delimiter(s) at which point will need to update the answer accordingly
  • OP mentions the creation of 50 files; this answer will hold all 50 files open while processing the file; GNU awk should have no problems maintaining 50 open file descriptors but I don't know about other flavors of awk

Sample input file:

$ cat file.txt
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10

With n=3:

$ head outfile_*.txt
==> outfile_1.txt <==
1 2 3
1 2 3
1 2 3

==> outfile_2.txt <==
4 5 6
4 5 6
4 5 6

==> outfile_3.txt <==
7 8 9
7 8 9
7 8 9

==> outfile_4.txt <==
10
10
10

With n=4:

$ head outfile_*.txt
==> outfile_1.txt <==
1 2 3 4
1 2 3 4
1 2 3 4

==> outfile_2.txt <==
5 6 7 8
5 6 7 8
5 6 7 8

==> outfile_3.txt <==
9 10
9 10
9 10
  • Related