Extract capture group, if it exists, otherwise, just extract the original string-CodePudding

Given a String, I'd like to use a regex to:

if the given String does NOT match regex, return the ENTIRE String
if the given String does match regex, then return ONLY the capture group

Let's say I have the following regex:

hello\s*([a-z] )

Here are inputs and the return I am looking for:

"well hello" --> "well hello" (regex did not match)
"well hello world extra words" --> "world"
"well hello   world!!!" --> "world"
"well hello \n \n world\n\n\n" --> "world" (should ignore all newlines)
"this string doesn't match at all" --> "this string doesn't match at all"

Limitations: I am only limited to using grep, sed, and awk. egrep, gawk are not available.

> print "world hello something else\n" | sed -rn "s/hello ([a-z] )/\1/p"
world something else

This is the closest I've gotten. A few things:

it is returning other parts of the string
I couldn't get \s* to match, but a regular space works
not exactly sure, but the /p at the end of sed seems to print a newline

CodePudding user response：

Use an alternation:

hello\s*([a-z] )|(.*)

Then extract groups 1 and 2:

sed -rn "s/hello ([a-z] )|(.*)/\1\2/p"

The alternation matches left to right, so if the first parts doesn't match, the whole input is matched; one of group 1 or group 2 will be blank.

CodePudding user response：

This might work for you (GNU sed):

sed -E 's/\\n/\n/g;/^well hello\s*([a-z] ).*/s//\1/;s/\n/\\n/g' file

Turn \n into real newlines.

Match on lines that begin well hello, followed by zero or more white space, followed by one or more characters a thru z, followed by whatever. If the match is true, return the characters a thru z otherwise return the original string.