Treating very big file with bash commands-CodePudding

I am not sure how to ask my question so it may be the problem but google is not helping me.

Lets talk about big files (last one I had to deal with : 18 Gig) Let s say that you have files with million of lines (last one I had to deal with : 6 millions) Let s say that each line is more or less of the same size (last one I had to deal with : 30 to 40 K) Lets say that you want a given line that you grep or awk your file but you knwo you will skip the first 3 million lines, in my case I would like to skip 8 Gig of my file to be faster.

Is there a way ?

CodePudding user response：

You can use dd like this:

# make a 10GB file of zeroes
dd if=/dev/zero bs=1G count=10 > file

# read it, skipping first 9GB and count what you get
dd if=file bs=1G skip=9 | wc -c
1 0 records in
1 0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.84402 s, 582 MB/s
1073741824

Note that I am just demonstrating a concept of how easily you can skip 9GB. In practice, you may prefer to use a 100MB memory buffer and skip 90 of them, rather than allocating a whole GB. So, in practice, you might prefer:

dd if=file bs=100M skip=90 | wc -c

Note also that I am piping to wc rather than awk because my test data is not line oriented - it is just zeros.

Or, if your record size is 30kB and you want to skip a million records and discard diagnostic output:

dd if=file bs=30K skip=1000000 2> /dev/null | awk ...

Note that:

your line numbers will be "wrong" in awk (because awk didn't "see" them), and
your first line may be incomplete (because dd isn't "line oriented") but I guess that doesn't matter.

Note also, that it is generally very advantageous to use a large block size. So, if you want 8MB, you will do much better with bs=1m count=8 than with bs=8 count=1000000 which will cause a million writes of 8 bytes each.

CodePudding user response：

If you know the full size of the file (lets say 5 million lines) you can do this:

tail -2000000 filename|grep "yourfilter"

This way you will do whatever editing, or printing, starting below the first 3 million lines

CodePudding user response：

Not tested the performance on very large files, compared to tail | grep, but you could try GNU sed:

sed -n '3000001,$ {/your regex/p}' file

skips the first 3 millions lines and then prints all lines matching the your regex regular expression. Same with awk:

awk 'NR>3000000 && /your regex/' file