I am trying to count the number of line matches in a very LARGE file and store them in variables using only the BASH shell commands.
Currently, i am scanning the results of a very large file twice and using a separate grep statement each time, like so:
$ cat test.txt
first example line one
first example line two
first example line three
second example line one
second example line two
$ FIRST=$( cat test.txt | grep 'first example' | wc --lines ; ) ; ## first run
$ SECOND=$(cat test.txt | grep 'second example' | wc --lines ; ) ; ## second run
and I end up with this:
$ echo $FIRST
3
$ echo $SECOND
2
Hopefully, I want to only scan the large file just once. And I have never used Awk and would rather not use that!
The |tee
option is new to me. It seems that passing the results into two separate grep statements may mean that we only have to scan the large file once.
Ideally, I would also like to be able to do this without having to create any temporary files & subsequently having to remember to delete them.
I have tried multiple ways using something like these below:
FIRST=''; SECOND='';
cat test.txt \
|tee >(FIRST=$( grep 'first example' | wc --lines ;);) \
>(SECOND=$(grep 'second example' | wc --lines ;);) \
>/dev/null ;
and using read
:
FIRST=''; SECOND='';
cat test.txt \
|tee >(grep 'first example' | wc --lines | (read FIRST); ); \
>(grep 'second example' | wc --lines | (read SECOND); ); \
> /dev/null ;
cat test.txt \
| tee <( read FIRST < <(grep 'first example' | wc --lines )) \
<( read SECOND < <(grep 'sedond example' | wc --lines )) \
> /dev/null ;
and with curly brackets:
FIRST=''; SECOND='';
cat test.txt \
|tee >(FIRST={$( grep 'first example' | wc --lines ;)} ) \
>(SECOND={$(grep 'second example' | wc --lines ;)} ) \
>/dev/null ;
but none of these allow me to save the line count into variables FIRST and SECOND.
Is this even possible to do?
CodePudding user response:
tee
isn't saving any work. Each grep
is still going to do a full scan of the file. Either way you've got three passes through the file: two grep
s and one Useless Use of Cat. In fact tee
actually just adds a fourth program that loops over the whole file.
The various | tee
invocations you tried don't work because of one fatal flaw: variable assignments don't work in pipelines. That is to say, they "work" insofar as a variable is assigned a value, it's just the value is almost immediately lost. Why? Because the variable is in a subshell, not the parent shell.
Every command in a |
pipeline executes in a different process and it's a fundamental fact of Linux systems that processes are isolated from each other and don't share variable assignments.
As a rule of thumb, you can write variable=$(foo | bar | baz)
where the variable is on the outside. No problem. But don't try foo | variable=$(bar) | baz
where it's on the inside. It won't work and you'll be sad.
But don't lose hope! There are plenty of ways to skin this cat. Let's go through a few of them.
Two greps
Getting rid of cat
yields:
first=$(grep 'first example' test.txt | wc -l)
second=$(grep 'second example' test.txt | wc -l)
This is actually pretty good and will usually be fast enough. Linux maintains a large page cache in RAM. Any time you read a file Linux stores the contents in memory. Reading a file multiple times will usually hit the cache and not the disk, which is super fast. Even multi-GB files will comfortably fit into modern computers' RAM, particularly if you're doing the reads back-to-back while the cached pages are still fresh.
One grep
You could improve this by using a single grep
call that searches for both strings. It could work if you don't actually need the individual counts but just want the total:
total=$(grep -e 'first example' -e 'second example' test.txt | wc -l)
Or if there are very few lines that match, you could use it to filter down the large file into a small set of matching lines, and then use the original greps to pull out the separate counts:
matches=$(grep -e 'first example' -e 'second example' test.txt)
first=$(grep 'first example' <<< "$matches" | wc -l)
second=$(grep 'second example' <<< "$matches" | wc -l)
Pure bash
You could also build a Bash-only solution that does a single pass and invokes no external programs. Forking processes is slow, so using only built-in commands like read
and [[
can offer a nice speedup.
First, let's start with a while read
loop to process the file line by line:
while IFS= read -r line; do
...
done < test.txt
You can count matches by using double square brackets [[
and string equality ==
, which accepts *
wildcards:
first=0
second=0
while IFS= read -r line; do
[[ $line == *'first example'* ]] && (( first))
[[ $line == *'second example'* ]] && (( second))
done < test.txt
echo "$first" ## should display 3
echo "$second" ## should display 2
Another language
If none of these are fast enough then you should consider using a "real" programming language like Python, Perl, or, really, whatever you are comfortable with. Bash is not a speed demon. I love it, and it's really underappreciated, but even I'll admit that high-performance data munging is not its wheelhouse.
CodePudding user response:
If you're going to be doing things like this, I'd really recommend getting familiar with awk
; it's not scary, and IMO it's much easier to do complex things like this with it vs. the weird pipefitting you're looking at. Here's a simple awk
program that'll count occurrences of both patterns at once:
awk '/first example/ {first }; /second example/ {second }; END {print first, second}' test.txt
Explanation: /first example/ {first }
means for each line that matches the regex pattern "first example", increment the first
variable. /second example/ {second }
does the same for the second pattern. Then END {print first second}
means at the end, it should print the two variables. Simple.
But there is one tricky thing: splitting the two numbers it prints into two different variables. You could do this with read
:
bothcounts=$(awk '/first example/ {first }; /second example/ {second }; END {print first, second}' test.txt)
read first second <<<"$bothcounts"
(Note: I recommend using lower- or mixed-case variable names, to avoid conflicts with the many all-caps names that have special functions.)
Another option is to skip the bothcounts
variable by using process substitution to feed the output from awk
directly into read
:
read first second < <(awk '/first example/ {first }; /second example/ {second }; END {print first, second}' test.txt)
CodePudding user response:
">" is about redirect to file/device, not to the next command in pipe. So tee will just allow you to redirect pipe to multiple files, not to multiple commands. So just try this:
FIRST=$(grep 'first example' test.txt| wc --lines)
SECOND=$(grep 'second example' test.txt| wc --lines)
CodePudding user response:
It's possible to get matches an count them in a single pass, then get the count of each from the result.
matches="$(grep -e 'first example' -e 'second example' --only-matching test.txt | sort | uniq -c | tr -s ' ')"
FIRST=$(grep -e 'first example' <<<"$matches" | cut -d ' ' -f 2)
echo $FIRST
Result:
3
Using awk
is the best option I think.