preg_match_all to get div content from a forum-CodePudding

I was an « Ouverture Facile » player for years, an old Flash game.

Since 2020 and the end of Flash, I started to convert it to html / js / css.

Swan, the author of the game, lost the sql database from where the original board was hosted. This part is needed to play some riddles. There is a « backup » on the wayback machine. but I have problems downloading some parts.

Here’s my code :

// username
    $recherche = "<span class=\"normalname\">(.*)<\/span>";
    preg_match_all("/$recherche/i",$page, $user);

// date / time
    $recherche = "<span class=\"postdetails\">(.*)<\/span>";
    preg_match_all("/$recherche/i",$page, $dateheure);

// message
    $recherche = "<div class=\"postcolor\" id=\"post-(.*)\">(.*)<\/div>";
    preg_match_all("/$recherche/i",$page, $texte);

I then use a loop to display / store everything I need in a new database. But, this doesn’t work for all messages. For example, this link : https://web.archive.org/web/20080614030522/http://www.ouverturefacile.com/forums/index.php?showtopic=2

I get 4 usernames, 4 dates / times but only 3 messages ! This one : « J’arrive pas à passer ce niveau… :cry: » can’t be found by my script. It looks like the preg_match_all stops after a line break. Can someone help me,please ? Thanks :)

Note 1 : I have the permission from the original author of the game Note 2 : I know that preg_mach_all is slow, but it’s only going to be used once from a private server without limits.

CodePudding user response：

You will want to read up on PCRE modifiers for RegEx patterns. In this case, if you add the modifiers s (dot matches multi-line) and U (ungreedy), you will get the results you are looking for.

s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded.

U (PCRE_UNGREEDY) This modifier inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by ?.

Without s, matches spanning multiple lines are not matched. If you only added s, the regex engine would munch up anything from the start tag until the last </div> down the markup. With the U flag in place, it will only match up to the first closing div tag, which is the measure of your forum post.

CodePudding user response：

Maybe instead of writing regex you should just use this html parser: https://github.com/simplehtmldom/simplehtmldom#usage