Home > Software design >  How to match some words after matching an entire sentence using regex?
How to match some words after matching an entire sentence using regex?

Time:05-01

I'm a newbie. I'm trying to find the full name in either one of the lines below and without the Obituary for

<h2>Obituary for John Doe</h2>
<h1>James Michael Lee</h1>

My regex is this.

(<h1>(. ?)<\/h1>|<h2>Obituary\sfor\s(. ?)<\/h2>)

What I'm getting is still Obituary for John Doe. How to remove the Obituary for?

CodePudding user response:

Many roads lead to Rome, you can probably do something like this:

<h(?:1>|2>Obituary\sfor\s)\K[^><] 

See this demo at regex101. The matches will be in $out[0].

\K resets beginning of the reported match. See the SO Regex FAQ for more.

CodePudding user response:

Could you do something like this without using regex?

/**
 * @description : Function extracts names from html header tags
 * @example : "<h2>Obituary for John Doe</h2><h1>James Michael Lee</h1>" -> ["John Doe", "James Michael Lee"]
 * @param $html string
 * @return []string : list of full names
*/
function extractFullNames($html) {
    $regex = '/<h[1-2]>(.*?)<\/h[1-2]>/';
    preg_match_all($regex, $html, $matches);
    $names = $matches[1];
    $names = array_map('trim', $names);
    $names = array_map('strip_tags', $names);
    $names = array_map('strtolower', $names);
    $names = array_map('ucwords', $names);
    $names = array_map('removeObituary', $names); 
    return $names;
}

/**
 * @description : Function used to remove "Obituary For" if present
 * @example : "Obituary For John Doe" -> "John Doe"
 * @param $name string
 * @return string : name without "Obituary For"
*/
function removeObituary($name) {
    $name = str_replace("Obituary For ", "", $name);
    return $name;
} 

// Test cases
$html = '<h2>Obituary for John Doe</h2><h1>James Michael Lee</h1>';
$names = extractFullNames($html);
$expected = ['John Doe', 'James Michael Lee'];

echo "Expected: " . implode(', ', $expected) . "\n";
echo "Actual: " . implode(', ', $names);

CodePudding user response:

i'd probably do something like

/^(?:\s<[^>]*?>)?(?:.*\s for\s )?([^<]*)/

and extract $1 (the first match group).

  • Related