I'm trying to extract all CSS file URLs from websites, here is my code:
<?php
$re = '/(?<=href=[\'\"])[^"] \.css/mi';
$str = file_get_contents('test.html');
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
it works for this:
<link rel="stylesheet" type="text/css" href="http://example.com/mystyle.css">
and this one: <link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>
but not for this:
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>
the output will be:
array(1) {
[0]=>
array(1) {
[0]=>
string(185) "http://example.com/mystyle.css'>
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css"
}
}
CodePudding user response:
Your regex's issue is greediness. Change the
to ?
so it is not greedy that will stop it at first match rather than last.
(?<=href=[\'"])[^"] ?\.css
- Note quotes also won't need to be escaped in regex. The escaping you have done is only for PHP so you only need to escape the single quote.
- The negated character class in yours would be incorrect if a single quote were used for encapsulation as well.
- The
m
modifier is not needed because your regex is not using anchors.
I think:
href=(["'])(. ?[.]css)\1
would be a bit better but that too has flaws. A parser would be a better way to achieve this.
Index 2 will(should) have your URLs. https://regex101.com/r/MfBmMq/1/
CodePudding user response:
Never parse HTML with regex. you will be frustrated eventually.
Use HTML parser like https://github.com/paquettg/php-html-parser instead.