Home > Net >  grep cannot find all occurrences of a substring in a very big file
grep cannot find all occurrences of a substring in a very big file

Time:10-19

I am on a Ubuntu OS, in a Bash shell, trying to use grep to find all occurrences of substring engineBreakdown() inside a .tra extention log file, let's say my_log_16.tra, and save the results inside a file, let's say results_16.txt

So I run

cat /path/to/my_log_16.tra | grep "engineBreakdown()" > results_16.txt

and when I run less results_16.txt I actually see that there inside are saved some lines containing the substring, but they are not all the lines I expected.

In fact, when I manually search the occurrences of engineBreakdown() down my_log_16.tra, I see that there are other lines containing the substring, but these are not saved into results_16.txt. So it seems that my command only saves the first occurrences of the substring.

I think the grep may fail because my_log_16.tra is a very large file ( about 100 MB ).

If this is the cause, is there a more reliable way to find all occurrences of a substring in a very big file?

version of grep on my machine

grep --version
grep (GNU grep) 2.25
Copyright (C) 2016 Free Software Foundation, Inc.     
License GPLv3 : GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.     
This is free software: you are free to change and redistribute it.     
There is NO WARRANTY, to the extent permitted by law.         

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

Example of lines from my_log_16.tra

lines correctly detected and saved into results_16.txt

[I 2022-10-16 07:26:35.449 Rservice:75] engineBreakdown()
[I 2022-10-16 07:26:35.846 Rservice:75] engineBreakdown()
[I 2022-10-16 07:26:35.848 Rservice:75] engineBreakdown()

a piece of the file where the substring appears, but it is not saved into results_16.txt

[I 2022-10-16 11:32:48.039 web:2064] 200 GET /static/ui-src/default/img/Customer.png?v=0.9702853857687699 (127.0.0.1) 10.49ms
[I 2022-10-16 11:32:49.778 Rservice:75] engineBreakdown()
[I 2022-10-16 11:32:50.122 websocketclient:62] Connection : url::ws://localhost:3333/ws
[I 2022-10-16 11:32:50.125 Rservice:75] engineBreakdown()
[I 2022-10-16 11:32:50.128 Rservice:75] engineBreakdown()
[I 2022-10-16 11:32:55.123 websocketclient:62] Connection : url::ws://localhost:3333/ws
[I 2022-10-16 11:32:55.128 Rservice:75] engineBreakdown()
[I 2022-10-16 11:32:55.134 Rservice:75] engineBreakdown()

another piece of the file where the substring appears, but it is not saved into results_16.txt

[I 2022-10-17 04:00:35.127 Rservice:75] engineBreakdown()
[I 2022-10-17 04:00:35.138 Rservice:75] engineBreakdown()
[I 2022-10-17 04:00:39.206 websocketclient:62] Connection : url::ws://127.0.0.1:9999/request
[I 2022-10-17 04:00:39.220 websocketclient:62] Connection : url::ws://127.0.0.1:9999/auxiliary
[I 2022-10-17 04:00:39.228 channels:75] _on_connection_error, host=127.0.0.1, port=9999
[I 2022-10-17 04:00:39.233 channels:82] _on_connection_close, host=127.0.0.1, port=9999
[I 2022-10-17 04:00:39.237 channels:75] _on_connection_error, host=127.0.0.1, port=9999
[I 2022-10-17 04:00:39.243 channels:82] _on_connection_close, host=127.0.0.1, port=9999
[I 2022-10-17 04:00:40.122 websocketclient:62] Connection : url::ws://localhost:3333/ws
[I 2022-10-17 04:00:40.128 Rservice:75] engineBreakdown()
[I 2022-10-17 04:00:40.133 Rservice:75] engineBreakdown()
[I 2022-10-17 04:00:44.206 websocketclient:62] Connection : url::ws://127.0.0.1:9999/request
[I 2022-10-17 04:00:44.221 websocketclient:62] Connection : url::ws://127.0.0.1:9999/auxiliary
[I 2022-10-17 04:00:44.227 channels:75] _on_connection_error, host=127.0.0.1, port=9999
[I 2022-10-17 04:00:44.232 channels:82] _on_connection_close, host=127.0.0.1, port=9999
[I 2022-10-17 04:00:44.234 channels:75] _on_connection_error, host=127.0.0.1, port=9999
[I 2022-10-17 04:00:44.237 channels:82] _on_connection_close, host=127.0.0.1, port=9999
[I 2022-10-17 04:00:45.122 websocketclient:62] Connection : url::ws://localhost:3333/ws
[I 2022-10-17 04:00:45.126 Rservice:75] engineBreakdown()
[I 2022-10-17 04:00:45.128 Rservice:75] engineBreakdown()

update 1

I also tryed with

grep "engineBreakdown()" /path/to/my_log_16.tra > results_16.txt

but the result is the same.

update 2

As suggested, double quotes might not be enough to handle the parentheses properly, so I removed the parentheses from the input substring and changed the double quotes to single ones

grep "engineBreakdown" /path/to/my_log_16.tra > results_16.txt

grep 'engineBreakdown' /path/to/my_log_16.tra > results_16.txt

but the result is the same.

CodePudding user response:

You can try if this awk helps.

Data

$ cat file
engineBreakdown()
engineBreakdown() engineBreakdown() engineBreakdown() engineBreakdown()
engineBreakdown()
$ awk -v var="engineBreakdown()" '
    $0~var{
      printf NR
      for(i=1;i<=NF;i  ){
        if($i~var){x  }
      }
      print " # matches: " x
      x=0
  }' file
1 # matches: 1
2 # matches: 4
3 # matches: 1

Just printing the lines (like grep) without substring detection simply do

$ awk -v var="engineBreakdown()" '$0~var{ print }' file
engineBreakdown()
engineBreakdown() engineBreakdown() engineBreakdown() engineBreakdown()
engineBreakdown()

CodePudding user response:

Seems like your grep command is behaving oddly (perhaps because you are using an old version that has some bug that was fixed later).

Here's an alternative with sed:

sed -n '/engineBreakdown()/p' /path/to/my_log_16.tra > op.txt

I'd recommend updating your grep installation. ripgrep is another alternative.

  • Related