Home > other >  Encoding issues lead to 2 folders created with same name in the same location
Encoding issues lead to 2 folders created with same name in the same location

Time:11-02

I believe encoding issues are resulting in the creation of folders with the same same in the same location.

The first folder is created directly from the browser using a php backend. The second is also created using the same php backend however the request comes from an IOS app.

The logic is rather simple, I send to the server a folder name to create, if the folder doesn't exist then he creates it and put a file inside otherwise he just put a file inside the existing folder.

Below is a screenshot of how it looks on the server 2 folders with the same name "Equipe Rouge" are existing in the same location.

enter image description here

A strange finding, is if I send a request using PostMan on Windows, the existing folder is detected and the file put inside without issue, however if I use Postman on Mac I get the same issue, where the folder is not detected as existing and created again.

Here is how I retrieve the folder name on php side:

$name = $_POST["name"];

CodePudding user response:

Apply Normalizer::normalize -- normalizer_normalize:

<?php
$char__A_acute = "\xC3\x89";  // 'Latin Capital Letter E With Acute' (U 00C9)
$chars_A_acute = "\x45"       // 'Latin Capital Letter E' (U 0045)
               .  "\xCC\x81"; // 'Combining Acute Accent' (U 0301)
var_dump( $char__A_acute );
var_dump( $chars_A_acute );
var_dump( $char__A_acute == $chars_A_acute );                       
var_dump( Normalizer::normalize( $char__A_acute, Normalizer::FORM_D )
       == $chars_A_acute);                       
var_dump( $char__A_acute
       == Normalizer::normalize( $chars_A_acute, Normalizer::FORM_C ));                       
?>

Output (in fact, you can see string(3) "E⁠´" at the 2nd line in a simple "terminal" e.g. Windows command line cmd):

string(2) "É"
string(3) "É"
bool(false)
bool(true)
bool(true)

In theory: normalization forms for Unicode text

FYI, you see a mojibake case in E�quipe Rouge as � is cp1252 interpretation of utf-8 bytes of Combining Acute Accent.

  • Related