Saving values in BASH shell variables while using |tee-CodePudding

I am trying to count the number of line matches in a very LARGE file and store them in variables using only the BASH shell commands.

Currently, i am scanning the results of a very large file twice and using a separate grep statement each time, like so:

$ cat test.txt 
first example line one
first example line two
first example line three
second example line one
second example line two
$ FIRST=$( cat test.txt | grep 'first example'  | wc --lines ; ) ;  ## first run
$ SECOND=$(cat test.txt | grep 'second example' | wc --lines ; ) ;  ## second run

and I end up with this:

$ echo $FIRST
3
$ echo $SECOND
2

Hopefully, I want to only scan the large file just once. And I have never used Awk and would rather not use that!

The |tee option is new to me. It seems that passing the results into two separate grep statements may mean that we only have to scan the large file once.

Ideally, I would also like to be able to do this without having to create any temporary files & subsequently having to remember to delete them.

I have tried multiple ways using something like these below:

FIRST=''; SECOND='';
cat  test.txt                                                   \
    |tee  >(FIRST=$( grep 'first example'  | wc --lines ;);)    \
          >(SECOND=$(grep 'second example' | wc --lines ;);)    \
          >/dev/null        ;

and using read:

FIRST=''; SECOND='';
cat  test.txt                                                       \
   |tee  >(grep 'first example'   | wc --lines | (read FIRST);  );  \
         >(grep 'second example'  | wc --lines | (read SECOND); );  \
         > /dev/null                   ;



cat  test.txt                                                           \
      | tee  <( read FIRST  < <(grep 'first example'  | wc --lines ))   \
             <( read SECOND < <(grep 'sedond example' | wc --lines ))   \
             >    /dev/null             ;

and with curly brackets:

FIRST=''; SECOND='';
cat test.txt                                                     \
  |tee   >(FIRST={$( grep 'first example'  | wc --lines ;)} )    \
         >(SECOND={$(grep 'second example' | wc --lines ;)} )    \
         >/dev/null                           ;

but none of these allow me to save the line count into variables FIRST and SECOND.

Is this even possible to do?

CodePudding user response：

tee isn't saving any work. Each grep is still going to do a full scan of the file. Either way you've got three passes through the file: two greps and one Useless Use of Cat. In fact tee actually just adds a fourth program that loops over the whole file.

The various | tee invocations you tried don't work because of one fatal flaw: variable assignments don't work in pipelines. That is to say, they "work" insofar as a variable is assigned a value, it's just the value is almost immediately lost. Why? Because the variable is in a subshell, not the parent shell.

Every command in a | pipeline executes in a different process and it's a fundamental fact of Linux systems that processes are isolated from each other and don't share variable assignments.

As a rule of thumb, you can write variable=$(foo | bar | baz) where the variable is on the outside. No problem. But don't try foo | variable=$(bar) | baz where it's on the inside. It won't work and you'll be sad.

But don't lose hope! There are plenty of ways to skin this cat. Let's go through a few of them.

Two greps

Getting rid of cat yields:

first=$(grep 'first example' test.txt | wc -l)
second=$(grep 'second example' test.txt | wc -l)

This is actually pretty good and will usually be fast enough. Linux maintains a large page cache in RAM. Any time you read a file Linux stores the contents in memory. Reading a file multiple times will usually hit the cache and not the disk, which is super fast. Even multi-GB files will comfortably fit into modern computers' RAM, particularly if you're doing the reads back-to-back while the cached pages are still fresh.

One grep

You could improve this by using a single grep call that searches for both strings. It could work if you don't actually need the individual counts but just want the total:

total=$(grep -e 'first example' -e 'second example' test.txt | wc -l)

Or if there are very few lines that match, you could use it to filter down the large file into a small set of matching lines, and then use the original greps to pull out the separate counts:

matches=$(grep -e 'first example' -e 'second example' test.txt)
first=$(grep 'first example' <<< "$matches" | wc -l)
second=$(grep 'second example' <<< "$matches" | wc -l)

Pure bash

You could also build a Bash-only solution that does a single pass and invokes no external programs. Forking processes is slow, so using only built-in commands like read and [[ can offer a nice speedup.

First, let's start with a while read loop to process the file line by line:

while IFS= read -r line; do
   ...
done < test.txt

You can count matches by using double square brackets [[ and string equality ==, which accepts * wildcards:

first=0
second=0

while IFS= read -r line; do
    [[ $line == *'first example'* ]] && ((  first))
    [[ $line == *'second example'* ]] && ((  second))
done < test.txt

echo "$first"   ## should display 3
echo "$second"  ## should display 2

Another language

If none of these are fast enough then you should consider using a "real" programming language like Python, Perl, or, really, whatever you are comfortable with. Bash is not a speed demon. I love it, and it's really underappreciated, but even I'll admit that high-performance data munging is not its wheelhouse.

CodePudding user response：

If you're going to be doing things like this, I'd really recommend getting familiar with awk; it's not scary, and IMO it's much easier to do complex things like this with it vs. the weird pipefitting you're looking at. Here's a simple awk program that'll count occurrences of both patterns at once:

awk '/first example/ {first  }; /second example/ {second  }; END {print first, second}' test.txt

Explanation: /first example/ {first } means for each line that matches the regex pattern "first example", increment the first variable. /second example/ {second } does the same for the second pattern. Then END {print first second} means at the end, it should print the two variables. Simple.

But there is one tricky thing: splitting the two numbers it prints into two different variables. You could do this with read:

bothcounts=$(awk '/first example/ {first  }; /second example/ {second  }; END {print first, second}' test.txt)
read first second <<<"$bothcounts"

(Note: I recommend using lower- or mixed-case variable names, to avoid conflicts with the many all-caps names that have special functions.)

Another option is to skip the bothcounts variable by using process substitution to feed the output from awk directly into read:

read first second < <(awk '/first example/ {first  }; /second example/ {second  }; END {print first, second}' test.txt)

CodePudding user response：

">" is about redirect to file/device, not to the next command in pipe. So tee will just allow you to redirect pipe to multiple files, not to multiple commands. So just try this:

FIRST=$(grep 'first example' test.txt| wc --lines)
SECOND=$(grep 'second example' test.txt| wc --lines)

CodePudding user response：

It's possible to get matches an count them in a single pass, then get the count of each from the result.

matches="$(grep -e 'first example' -e 'second example' --only-matching test.txt | sort | uniq -c | tr -s ' ')"

FIRST=$(grep -e 'first example' <<<"$matches" | cut -d ' ' -f 2)
echo $FIRST

Result:

3

Using awk is the best option I think.