Home > Software design >  Find all CSS URLs on a website using PHP and regex
Find all CSS URLs on a website using PHP and regex

Time:11-28

I'm trying to extract all CSS file URLs from websites, here is my code:

<?php
$re = '/(?<=href=[\'\"])[^"] \.css/mi';
$str = file_get_contents('test.html');

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

it works for this: <link rel="stylesheet" type="text/css" href="http://example.com/mystyle.css"> and this one: <link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'> but not for this:

<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>

the output will be:

array(1) {
  [0]=>
  array(1) {
    [0]=>
    string(185) "http://example.com/mystyle.css'>
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css'>
<link rel='stylesheet' type='text/css' href='http://example.com/mystyle.css"
  }
}

CodePudding user response:

Your regex's issue is greediness. Change the to ? so it is not greedy that will stop it at first match rather than last.

(?<=href=[\'"])[^"] ?\.css
  • Note quotes also won't need to be escaped in regex. The escaping you have done is only for PHP so you only need to escape the single quote.
  • The negated character class in yours would be incorrect if a single quote were used for encapsulation as well.
  • The m modifier is not needed because your regex is not using anchors.

I think:

href=(["'])(. ?[.]css)\1

would be a bit better but that too has flaws. A parser would be a better way to achieve this.

Index 2 will(should) have your URLs. https://regex101.com/r/MfBmMq/1/

CodePudding user response:

Never parse HTML with regex. you will be frustrated eventually.

Use HTML parser like https://github.com/paquettg/php-html-parser instead.

  • Related