sed out string middle of string that may contain one or more numbers-CodePudding

My strings are:

"TESTING_ABC_1-JAN-2022.BCK-gz;1"
"TESTING_ABC_30-JAN-2022.BCK-gz;1"

In bash when I run: echo "TESTING_ABC_1-JAN-2022.BCK-gz;1" | sed 's/.*\([0-9]\{1,2\}-[A-Z][A-Z][A-Z]-[0-9][0-9][0-9][0-9]\).*/\1/' it returns 1-JAN-2022 which is good.

But when I run: echo "TESTING_ABC_30-JAN-2022.BCK-gz;1" | sed 's/.*\([0-9]\{1,2\}-[A-Z][A-Z][A-Z]-[0-9][0-9][0-9][0-9]\).*/\1/' I get 0-JAN-2022 but I want 30-JAN-2022.

From me passing in my string. How can I do it so that I can get single or double digit dates in one line like "30-JAN-2022" or "1-JAN-2022"

CodePudding user response：

Using sed

$ echo "TESTING_ABC_1-JAN-2022.BCK-gz;1
> TESTING_ABC_30-JAN-2022.BCK-gz;1" | sed -E 's/[^0-9]*([^.]*).*/\1/'
1-JAN-2022
30-JAN-2022

CodePudding user response：

It is much easier to use awk and avoid any regex:

cat file

TESTING_ABC_1-JAN-2022.BCK-gz;1
TESTING_ABC_30-JAN-2022.BCK-gz;1

awk -F '[_.]' '{print $3}' file

1-JAN-2022
30-JAN-2022

Another option is to use grep -Eo with a valid regex for date in DD-MON-YYYY format:

grep -Eo '[0-9]{1,2}-[A-Z]{3}-[0-9]{4}' file

1-JAN-2022
30-JAN-2022

CodePudding user response：

The problem with your regex is that greedy * quantifier: .* will match as many characters as possible while still being able to match the rest of your expression. In many regex implementations you can switch the greedyness of * by adding ?. So /.*?a/ would match as few characters as possible until it finds an a. Unfortunately, sed doesn't support switching greedyness. Here are two options:

If your string always ends with _ before the date, you can simply add _ to the .* part:

$ sed -r 's/.*_([0-9]{1,2}-[A-Z]{3}-[0-9]{4}).*/\1/' <<< "TESTING_ABC_30-JAN-2022.BCK-gz;1"
30-JAN-2022

Or just grep the relevant parts:

$ grep -Po '([0-9]{1,2}-[A-Z]{3}-[0-9]{4})' <<< "TESTING_ABC_30-JAN-2022.BCK-gz;1"
30-JAN-2022