I am new to RegEx. I am parsing a HTML page and because it is buggy I cannot use a XML or HTML parser. So I am using a regular expression. My code looks like this:
$html = '<html><div data-id="ABC012" data-index="123" ...';
preg_match_all('/<div data-id="[A-Z\\d] " data-index="\\d "/', $html, $result);
var_dump($result);
The output looks good so the code is working. Now I want to extract the matched values. I did it exactly as described in this answer and now the code looks like this:
$html = '<html><div data-id="ABC012" data-index="123" ...';
preg_match_all('/<div data-id="#([A-Z\\d] )" data-index="#(\\d )"/', $html, $result);
var_dump($result);
But it outputs an empty array. What is wrong? Please don't improve the pattern by adding the closing '>' or making it robust against white spaces. I just need to get the code running.
CodePudding user response:
You could write the code and the pattern like this, using a single backslash to match digits \d
and omit the #
in the pattern as that is not in the example data:
$html = '<html><div data-id="ABC012" data-index="123" ...';
preg_match_all('/<div data-id="([A-Z\d] )" data-index="(\d )"/', $html, $result);
var_dump($result);
Output
array(3) {
[0]=>
array(1) {
[0]=>
string(38) "<div data-id="ABC012" data-index="123""
}
[1]=>
array(1) {
[0]=>
string(6) "ABC012"
}
[2]=>
array(1) {
[0]=>
string(3) "123"
}
}