Home > Enterprise >  Saving values in BASH shell variables while using |tee
Saving values in BASH shell variables while using |tee

Time:09-27

I am trying to count the number of line matches in a very LARGE file and store them in variables using only the BASH shell commands.

Currently, i am scanning the results of a very large file twice and using a separate grep statement each time, like so:

$ cat test.txt 
first example line one
first example line two
first example line three
second example line one
second example line two
$ FIRST=$( cat test.txt | grep 'first example'  | wc --lines ; ) ;  ## first run
$ SECOND=$(cat test.txt | grep 'second example' | wc --lines ; ) ;  ## second run

and I end up with this:

$ echo $FIRST
3
$ echo $SECOND
2

Hopefully, I want to only scan the large file just once. And I have never used Awk and would rather not use that!

The |tee option is new to me. It seems that passing the results into two separate grep statements may mean that we only have to scan the large file once.

Ideally, I would also like to be able to do this without having to create any temporary files & subsequently having to remember to delete them.

I have tried multiple ways using something like these below:

FIRST=''; SECOND='';
cat  test.txt                                                   \
    |tee  >(FIRST=$( grep 'first example'  | wc --lines ;);)    \
          >(SECOND=$(grep 'second example' | wc --lines ;);)    \
          >/dev/null        ;

and using read:

FIRST=''; SECOND='';
cat  test.txt                                                       \
   |tee  >(grep 'first example'   | wc --lines | (read FIRST);  );  \
         >(grep 'second example'  | wc --lines | (read SECOND); );  \
         > /dev/null                   ;



cat  test.txt                                                           \
      | tee  <( read FIRST  < <(grep 'first example'  | wc --lines ))   \
             <( read SECOND < <(grep 'sedond example' | wc --lines ))   \
             >    /dev/null             ;

and with curly brackets:

FIRST=''; SECOND='';
cat test.txt                                                     \
  |tee   >(FIRST={$( grep 'first example'  | wc --lines ;)} )    \
         >(SECOND={$(grep 'second example' | wc --lines ;)} )    \
         >/dev/null                           ;
   

but none of these allow me to save the line count into variables FIRST and SECOND.

Is this even possible to do?

CodePudding user response:

tee isn't saving any work. Each grep is still going to do a full scan of the file. Either way you've got three passes through the file: two greps and one Useless Use of Cat. In fact tee actually just adds a fourth program that loops over the whole file.

The various | tee invocations you tried don't work because of one fatal flaw: variable assignments don't work in pipelines. That is to say, they "work" insofar as a variable is assigned a value, it's just the value is almost immediately lost. Why? Because the variable is in a subshell, not the parent shell.

Every command in a | pipeline executes in a different process and it's a fundamental fact of Linux systems that processes are isolated from each other and don't share variable assignments.

As a rule of thumb, you can write variable=$(foo | bar | baz) where the variable is on the outside. No problem. But don't try foo | variable=$(bar) | baz where it's on the inside. It won't work and you'll be sad.

But don't lose hope! There are plenty of ways to skin this cat. Let's go through a few of them.

Two greps

Getting rid of cat yields:

first=$(grep 'first example' test.txt | wc -l)
second=$(grep 'second example' test.txt | wc -l)

This is actually pretty good and will usually be fast enough. Linux maintains a large page cache in RAM. Any time you read a file Linux stores the contents in memory. Reading a file multiple times will usually hit the cache and not the disk, which is super fast. Even multi-GB files will comfortably fit into modern computers' RAM, particularly if you're doing the reads back-to-back while the cached pages are still fresh.

One grep

You could improve this by using a single grep call that searches for both strings. It could work if you don't actually need the individual counts but just want the total:

total=$(grep -e 'first example' -e 'second example' test.txt | wc -l)

Or if there are very few lines that match, you could use it to filter down the large file into a small set of matching lines, and then use the original greps to pull out the separate counts:

matches=$(grep -e 'first example' -e 'second example' test.txt)
first=$(grep 'first example' <<< "$matches" | wc -l)
second=$(grep 'second example' <<< "$matches" | wc -l)

Pure bash

You could also build a Bash-only solution that does a single pass and invokes no external programs. Forking processes is slow, so using only built-in commands like read and [[ can offer a nice speedup.

First, let's start with a while read loop to process the file line by line:

while IFS= read -r line; do
   ...
done < test.txt

You can count matches by using double square brackets [[ and string equality ==, which accepts * wildcards:

first=0
second=0

while IFS= read -r line; do
    [[ $line == *'first example'* ]] && ((  first))
    [[ $line == *'second example'* ]] && ((  second))
done < test.txt

echo "$first"   ## should display 3
echo "$second"  ## should display 2

Another language

If none of these are fast enough then you should consider using a "real" programming language like Python, Perl, or, really, whatever you are comfortable with. Bash is not a speed demon. I love it, and it's really underappreciated, but even I'll admit that high-performance data munging is not its wheelhouse.

CodePudding user response:

If you're going to be doing things like this, I'd really recommend getting familiar with awk; it's not scary, and IMO it's much easier to do complex things like this with it vs. the weird pipefitting you're looking at. Here's a simple awk program that'll count occurrences of both patterns at once:

awk '/first example/ {first  }; /second example/ {second  }; END {print first, second}' test.txt

Explanation: /first example/ {first } means for each line that matches the regex pattern "first example", increment the first variable. /second example/ {second } does the same for the second pattern. Then END {print first second} means at the end, it should print the two variables. Simple.

But there is one tricky thing: splitting the two numbers it prints into two different variables. You could do this with read:

bothcounts=$(awk '/first example/ {first  }; /second example/ {second  }; END {print first, second}' test.txt)
read first second <<<"$bothcounts"

(Note: I recommend using lower- or mixed-case variable names, to avoid conflicts with the many all-caps names that have special functions.)

Another option is to skip the bothcounts variable by using process substitution to feed the output from awk directly into read:

read first second < <(awk '/first example/ {first  }; /second example/ {second  }; END {print first, second}' test.txt)

CodePudding user response:

">" is about redirect to file/device, not to the next command in pipe. So tee will just allow you to redirect pipe to multiple files, not to multiple commands. So just try this:

FIRST=$(grep 'first example' test.txt| wc --lines)
SECOND=$(grep 'second example' test.txt| wc --lines)

CodePudding user response:

It's possible to get matches an count them in a single pass, then get the count of each from the result.

matches="$(grep -e 'first example' -e 'second example' --only-matching test.txt | sort | uniq -c | tr -s ' ')"

FIRST=$(grep -e 'first example' <<<"$matches" | cut -d ' ' -f 2)
echo $FIRST

Result:

3

Using awk is the best option I think.

  • Related