I am not sure how to ask my question so it may be the problem but google is not helping me.
Lets talk about big files (last one I had to deal with : 18 Gig) Let s say that you have files with million of lines (last one I had to deal with : 6 millions) Let s say that each line is more or less of the same size (last one I had to deal with : 30 to 40 K) Lets say that you want a given line that you grep or awk your file but you knwo you will skip the first 3 million lines, in my case I would like to skip 8 Gig of my file to be faster.
Is there a way ?
CodePudding user response:
You can use dd
like this:
# make a 10GB file of zeroes
dd if=/dev/zero bs=1G count=10 > file
# read it, skipping first 9GB and count what you get
dd if=file bs=1G skip=9 | wc -c
1 0 records in
1 0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.84402 s, 582 MB/s
1073741824
Note that I am just demonstrating a concept of how easily you can skip 9GB. In practice, you may prefer to use a 100MB memory buffer and skip 90 of them, rather than allocating a whole GB. So, in practice, you might prefer:
dd if=file bs=100M skip=90 | wc -c
Note also that I am piping to wc
rather than awk
because my test data is not line oriented - it is just zeros.
Or, if your record size is 30kB and you want to skip a million records and discard diagnostic output:
dd if=file bs=30K skip=1000000 2> /dev/null | awk ...
Note that:
- your line numbers will be "wrong" in
awk
(becauseawk
didn't "see" them), and - your first line may be incomplete (because
dd
isn't "line oriented") but I guess that doesn't matter.
Note also, that it is generally very advantageous to use a large block size. So, if you want 8MB, you will do much better with bs=1m count=8
than with bs=8 count=1000000
which will cause a million writes of 8 bytes each.
CodePudding user response:
If you know the full size of the file (lets say 5 million lines) you can do this:
tail -2000000 filename|grep "yourfilter"
This way you will do whatever editing, or printing, starting below the first 3 million lines
CodePudding user response:
Not tested the performance on very large files, compared to tail | grep
, but you could try GNU sed
:
sed -n '3000001,$ {/your regex/p}' file
skips the first 3 millions lines and then prints all lines matching the your regex
regular expression. Same with awk
:
awk 'NR>3000000 && /your regex/' file