Home > database >  Regex (PHP) to extract a sentence that contains a link
Regex (PHP) to extract a sentence that contains a link

Time:12-22

I want to retrieve the entire sentence that surrounds a link, delimited by punctuation (such as . or ! or ? or newline).

The purpose is to provide a better context for the link.

So for example if i have this...

$input = "I don't want this piece! This is the <a href='https://example.com/my-sentence'>sentence</a> I want. In don't want this piece either";
$filter = "https://example.com/my-sentence";

... I need to get to that...

$output = "This is the sentence I want.";

So far, I managed to isolate a sentence that doesn't contain tags, like this:

$input = "I don't want this piece. This is the sentence I want. In don't want this piece either";
$filter = "sentence";
$regex = '/[A-Z][^\\.;]*('.$filter.')[^\\.;]*/';
if (preg_match($regex, $input, $match))
$output = $match[0];

This works just fine. Next, I don't know how to get around the punctuation inside the url.

I explored isolating the anchor first and regexing that, which works on any single example but may generate collisions in the wild (anchors duplicating other anchors or random text).

Another way to go seems to be strip_tags, something like...

$input = strip_tags($input);

... the problem being that I need them both stripped and not stripped at the same time.

Maybe a more specific regex or some smart wrapping of the functions could bring an easy way out of this, or maybe it's a dead end and some other approach is required, I don't know, but right now I'm stuck, please help!

CodePudding user response:

Granted you do not care about abbreviations, you can match either a char other than ?, ! and ., or a link-like substring any zero or more times before and after a specific filter string:

$input = "I don't want this piece! This is the <a href='https://example.com/my-sentence'>sentence</a> I want. In don't want this piece either";
$filter = "sentence";
$regex = '~\b(?:[^.?!]|https?://[^<>\s"\']  )*?'.preg_quote($filter, '~').'(?:[^.?!]|https?://[^<>\s"\']  )*~u';
if (preg_match_all($regex, $input, $match)){
  print_r( array_map(function($x) {return strip_tags($x);}, $match[0]) );
}

See the PHP demo. Output:

Array
(
    [0] => This is the sentence I want
)

See the regex demo. Details:

  • \b - a word boundary
  • (?:[^.?!]|https?://[^<>\s"\'] )*? - zero or more occurrences, as few as possible, of either a char other than ., ? and ! or http, an optional s, :// and then one or more chars other than <, >, whitespace, ", '
  • sentence - a filter string
  • (?:[^.?!]|https?://[^<>\s"\'] )* - zero or more occurrences, as many as possible, of either a char other than ., ? and ! or http, an optional s, :// and then one or more chars other than <, >, whitespace, ", '
  • Related