Home > Software design >  How to get text inside html using regex?
How to get text inside html using regex?

Time:11-09

How to get into separate arrays all words on the left side of the dash and on the right side of the dash? I only need words inside <p > elements. I was only able to get the last line. https://regex101.com/r/QpeiYK/1

CodePudding user response:

Regular expressions are not the right tool to use here. Since you are dealing with HTML, you want to use an HTML parser.

$parser = new DOMDocument;
$parser->loadHTML($htmlString);

$xpath = new DOMXPath($parser);
$nodes = $xpath->query("//p[contains(@class, 'example')]");
foreach($nodes as $node) {
    $lines = $node->textContent;
}

Demo: Try it online!

CodePudding user response:

People tend to assume that Regex is some magical superpower that can do anything from parsing HTML to curing cancer (not really), but it's just a tool like any other. And in this case, it's the wrong tool.

To solve this problem, break it down into smaller problems.

Step 1: find the text

Use a DOM Parser. It's much safer and more reliable than trying to regex your way through.

<?php
$dom = new DOMDocument();
$dom->loadHTML("Your HTML string here");
$xpath = new DOMXPath($dom);
$divs = $xpath->query("//div[contains(@class,'example')");
// XPath doesn't support class matching like CSS,
// so double-check the element is of the right class with:
$hasClass = function($elem, $cls) {
  $classes = explode(" ", $elem->getAttribute("class") ?? "");
  foreach( $classes as $c) {
    if( $c === $cls) return true;
  }
  return false;
};

$matches = [];
foreach( $divs as $div) {
  if( !$hasClass($div, "example")) continue;
  $matches[] = trim($div->textContent);
}

Step 2: process the text

Now that you have your blocks of text, you can process them accordingly. In this case, you have lines where "left words" and "right words" are separated by .

$pairs = [];
foreach( $matches as $match) {
  $lines = explode("\n", $match);
  foreach( $lines as $line) {
    list($left, $right) = explode("—",$line);
    $pairs[] = [trim($left), trim($right)];
  }
}

And now you have an array of matching pairs. No regex needed!

  • Related