Home > Software design >  splitting up big file using AWK, cannot get past 252 split files
splitting up big file using AWK, cannot get past 252 split files

Time:01-04

I want to split a large file (7.5MB) into multiple smaller files based on regex timestamp pattern, and there are 566 timestamps in the file:

The large file is made up of multiple blocks of data, each block contains: timestamp data, and it looks like this (line 1 is the first timestamp):

12/20/2022 23:18:56

blah
blah
blah
blah
blah
blah
12/20/2022 23:23:56

blah
blah
blah
12/20/2022 23:28:56
blah
...
...
...

Each smaller, split-up file should only contain one timestamp & one block of data, e.g.:

12/20/2022 23:23:56

blah
blah
blah

I'm using awk to look for each timestamp, and once found, each timestamp data is written to a split file, until the next timestamp is found, which then creates the next split file:

regex='([0-9]{2}\/[0-9]{2}\/[0-9]{4})'
awk -v regex=$regex '$0 ~ regex{x="split"  i}; i > 0 {print > x;}' $bigfile

This works great (i.e. files split1-252 are exactly what I expected) until awk encounters the 253rd occurrence of the timestamp, and then it errors out:

awk: can't open file split253
 source line number 1

As far as I can tell, there's nothing different about the 253rd timestamp, so I saved 253rd through 566th timestamp occurrences as a new file, so the new file has a total of 314 occurrence of the timestamp pattern, and rerun my code against the new file. Interestingly enough, awk errored out again with the exact same message:

awk: can't open file split253
 source line number 1

It almost seems the way I have written theawk command can only handle creating 252 files based on the regex pattern, but I'm not sure what's causing this limitation? Any advice would be greatly appreciated.

I've been research/googling this for couple of days now, and did find another post with similar issue, and I did try setting an initial value for x, but that still gave me the same error. Furthermore, if an initial value for x is needed, I thought AWK would error out immediately, rather than working correctly for split1-252, and then error out at 253.

CodePudding user response:

There's always a limit to how many files one process can have open, and different awk versions also have their own limits which can be as low as 10. Some awks (e.g. GNU awk) handle the external limit internally but it slows them down while other awks just fail as you see. Just close the output files as you go:

regex='[0-9]{2}/[0-9]{2}/[0-9]{4}'
awk -v regex="$regex" '$0 ~ regex{close(x); x="split"(  i)}; i{print > x}' "$bigfile"

I tidied up your regexp, quoting, etc. too. Obviously you don't actually need to declare a shell variable to hold the regexp:

awk '/[0-9]{2}\/[0-9]{2}\/[0-9]{4}/{close(x); x="split"(  i)}; i{print > x}' "$bigfile"

CodePudding user response:

You can also use csplit:

csplit -qz ip.txt '/[0-9]\{2\}\/[0-9]\{2\}\/[0-9]\{4\}/' '{*}'

This will create files named xx00, xx01, xx02, etc. You can customize the output names. For example, -n1 -f'split' will give names like split0, split1, etc.

  • Related