I was an « Ouverture Facile » player for years, an old Flash game.
Since 2020 and the end of Flash, I started to convert it to html / js / css.
Swan, the author of the game, lost the sql database from where the original board was hosted. This part is needed to play some riddles. There is a « backup » on the wayback machine. but I have problems downloading some parts.
Here’s my code :
// username
$recherche = "<span class=\"normalname\">(.*)<\/span>";
preg_match_all("/$recherche/i",$page, $user);
// date / time
$recherche = "<span class=\"postdetails\">(.*)<\/span>";
preg_match_all("/$recherche/i",$page, $dateheure);
// message
$recherche = "<div class=\"postcolor\" id=\"post-(.*)\">(.*)<\/div>";
preg_match_all("/$recherche/i",$page, $texte);
I then use a loop to display / store everything I need in a new database. But, this doesn’t work for all messages. For example, this link : https://web.archive.org/web/20080614030522/http://www.ouverturefacile.com/forums/index.php?showtopic=2
I get 4 usernames, 4 dates / times but only 3 messages ! This one : « J’arrive pas à passer ce niveau… :cry: » can’t be found by my script. It looks like the preg_match_all stops after a line break. Can someone help me,please ? Thanks :)
Note 1 : I have the permission from the original author of the game Note 2 : I know that preg_mach_all is slow, but it’s only going to be used once from a private server without limits.
CodePudding user response:
You will want to read up on PCRE modifiers for RegEx patterns. In this case, if you add the modifiers s
(dot matches multi-line) and U
(ungreedy), you will get the results you are looking for.
s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded.
U (PCRE_UNGREEDY) This modifier inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by ?.
Without s
, matches spanning multiple lines are not matched. If you only added s
, the regex engine would munch up anything from the start tag until the last </div>
down the markup. With the U
flag in place, it will only match up to the first closing div tag, which is the measure of your forum post.
CodePudding user response:
Maybe instead of writing regex you should just use this html parser: https://github.com/simplehtmldom/simplehtmldom#usage