Find all CSS URLs on a website using PHP and regex-CodePudding

I'm trying to extract all CSS file URLs from websites, here is my code:

<?php
$re = '/(?<=href=[\'\"])[^"] \.css/mi';
$str = file_get_contents('test.html');

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

it works for this: <link rel="stylesheet" type="text/css" href="http://example.com/mystyle.css"> and this one: <link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'> but not for this:

<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>

the output will be:

array(1) {
  [0]=>
  array(1) {
    [0]=>
    string(185) "http://example.com/mystyle.css'>
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css"
  }
}

CodePudding user response：

Your regex's issue is greediness. Change the to ? so it is not greedy that will stop it at first match rather than last.

(?<=href=[\'"])[^"] ?\.css

Note quotes also won't need to be escaped in regex. The escaping you have done is only for PHP so you only need to escape the single quote.
The negated character class in yours would be incorrect if a single quote were used for encapsulation as well.
The m modifier is not needed because your regex is not using anchors.

I think:

href=(["'])(. ?[.]css)\1

would be a bit better but that too has flaws. A parser would be a better way to achieve this.

Index 2 will(should) have your URLs. https://regex101.com/r/MfBmMq/1/

CodePudding user response：

Never parse HTML with regex. you will be frustrated eventually.

Use HTML parser like https://github.com/paquettg/php-html-parser instead.