Home > Software engineering >  Capture a string or part of a string up until a certain character
Capture a string or part of a string up until a certain character

Time:06-04

I have the following text:

    https://stackoverflow.com | https://google.com | first text to match | 
    https://randomsite.com | https://randomurl2.com | text | https://randomsite.com | 
    https://randomsite.com | https://randomsite.com |

I'm trying to match the first sequence of the string which is not a url, up until |. In this example I would like the regex to match:

    https://stackoverflow.com | https://google.com | first text to match |

Currently I have this:

/^(.*)[|]\s(\b\w*\b)?\s[|]/gm

However, this only works if the first sequence which is not a url is only a string without spaces. If first text to match was just first, then it would match.

The desired result would be to match both cases, with strings without spaces and match strings with spaces.

EDIT: Sometimes I would also need a greedy match, where the regex would match everything up until text |.

CodePudding user response:

If you have to match at least a leading url:

\A[\s\S]*?\b\K(?:https?://\S*\h*\|\h*) [^\s|][^|\r\n]*\|

Explanation

  • \A Start of string
  • [\s\S]*? Match any character as least as possible
  • \b\K A word boundary, then forget what is matched so far
  • (?:https?://\S*\h*\|\h*) Match one or more urls followed by | between optional spaces
  • [^\s|] Match a non whitespace char except for a pipe
  • [^|\r\n]* Optionally match any char except a pipe or a newline, then match the last pipe

Regex demo

If no leading urls is also ok:

\A[\s\S]*?\b\K(?:https?://\S*\h*\|\h*)*[^\s|][^|\r\n]*\|

Regex demo

Example

$re = '~\A[\s\S]*?\b\K(?:https?://\S*\h*\|\h*) [^\s|][^|\r\n]*\|~';
$str = '    https://stackoverflow.com | https://google.com | first text to match | 
    https://randomsite.com | https://randomurl2.com | text | https://randomsite.com | 
    https://randomsite.com | https://randomsite.com |';

if(preg_match($re, $str, $matches)) {
    echo $matches[0];
}

Output

https://stackoverflow.com | https://google.com | first text to match |

CodePudding user response:

You want to include spaces

/^(.*)[|]\s(\b(\w|\s)*\b)?\s[|]/gm

If you want to allow all sorts of special characters in the text (including new lines), you can try this approach:

\|\s*((?!\s*\w :\/\/)[^|] ?)\s\|

https://regex101.com/r/2OOKky/1

If you want to allow all sorts of special characters in the text (but no new lines), you can try this approach:

(?:^|\|)(?:(?!$)\s) ((?!\s*\w :\/\/)(?:(?!$)[^|]) ?)(?:(?!$)\s)*\|

https://regex101.com/r/HS3bra/1

  • Related