Home > front end >  How to extract substring between two other substrings?
How to extract substring between two other substrings?

Time:12-27

I have a script that reads a log file line-by-line. I need to extract the text between two subtstrings, if they exist in the line my script is currently reading.

For instance, if a line has:

some random text here substring A abc/def/ghi substring B

I need to extract the text abc/def/ghi that is between substring A and substring B by storing it in a variable. How would I go about doing this?

I looked through this Extract substring in Bash but can't find anything that exactly matches my use case.

CodePudding user response:

Bash provides parameter expansion with substring removal that allows you to trim through "substring A"from the front, and then trim "substring B" from the back leaving "abc/def/ghi". For example, you can do:

ssa="substring A"         ## substrings to find text between
ssb="substring B"

line="some random text here substring A abc/def/ghi substring B"

text="${line#*${ssa}}"    ## trim through $ssa from the front (left)
text="${text%${ssb}*}"    ## trim through $ssb from the back (right)

echo $text                ## output result

Example OUtput

abc/def/ghi

The basic two forms for trimming from the front of a string and the two from trimming from the back of a string are:

${var#pattern}      # Strip shortest match of pattern from front of $var
${var##pattern}     # Strip longest match of pattern from front of $var
${var%pattern}      # Strip shortest match of pattern from back of $var
${var%%pattern}     # Strip longest match of pattern from back of $var

Where pattern can contain globbing characters such as '*' and '?'. Look things over and let me know if you have any further questions.

Using BASH_REMATCH

BASH_REMATCH is an internal array that contains the results of matching [[ text =~ REGEX ]]. ${BASH_REMATCH[0]} is the total text matched by REGEX and then ${BASH_REMATCH[1..2..etc]} are the matched portions of the regular expression captures between (...) within the regular expression (of which you can provide multiple captures)

Using the same setup above, you could modify the script the replace the parameter expansions uses with text to use

regex="^.*${ssa} ([^ ] ) ${ssb}.*$"   ## REGEX to match with (..) capture

[[ $line =~ $regex ]] && echo ${BASH_REMATCH[1]}

Where the regular expression in $regex will match the entire line capturing what is between $ssa and $ssb. The complete modified script would be:

ssa="substring A"         ## substrings to find text between
ssb="substring B"

line="some random text here substring A abc/def/ghi substring B"

regex="^.*${ssa} ([^ ] ) ${ssb}.*$"   ## REGEX to match with (..) capture

[[ $line =~ $regex ]] && echo ${BASH_REMATCH[1]}

(same output)

Both methods are fully explained in man 1 bash. Use whichever fits the circumstance you are faced with. I always found parameter expansion a bit more intuitive (and you can incrementally whittle text down to just about anything you need). However, the power of extended regular expression matching can provide a powerful alternative to the parameter expansions.

CodePudding user response:

I believe you can do this:

var="$(echo "some random text here substring A abc/def/ghi substring B"|grep -oP "substring A \K(.*) (?=\ substring B)")"

# which produces:
echo $var
abc/def/ghi

or if the following grep is more readable, easier to understand, you can also use this:

grep -oP "(?<=substring\ A\ )(.*)(?=\ substring B)"

This is essentially the same logic as the above one.

This will also work if the searched/matched string is 2 or more words.


Edit 1:

So now I understand you are trying to do this by extracting the last line of a file, and then doing the regex matching? you can do:

var="$(tail -n1 file.txt|grep -oP "(?<=substring\ A\ )(.*)(?=\ substring B)")"

if you are sure that this file's last line would have always a last line matching the pattern in your original Question..

  • Related