Home > Software design >  Removing printing characters from a string using PHP
Removing printing characters from a string using PHP

Time:10-16

I am trying to removing some strange printing characters that are in several files, the contents of these files have been pulled into a PHP string.

I have tried using preg_replace to remove the strange printing characters, but haven't had much success.

The strange part is the regex I used with preg_replace does seem to work when I test it using a web based regex tester, so am confused as to why it doesn't work when I have the same regex in my PHP file.

The input data is just over 2000 lines, below is a snippet of the input data showing the þ which is what I am wanting to remove along with the $NoCode

$800C5304 0063
$800C5306 0063
$800C5308 0063
$800C530A 0063
$800C530C 0063
$800C530E 0063
$800C5310 0063
$800C5312 0063
$800C5314 0063
$800C5316 0063
$800C5318 0063
$800C531A 0063
$800C531C 0063
þ
$NoCode

This is the regex I have tried with preg_replace

$fileData = preg_replace("/\$([A-F0-9] ) ([A-F0-9] )\n(. )\n\$NoCode/", "'\$$1 $2'", $fileData);

From the link below, the þ seems to be or at least part of a byte order mark in UTF-16.

Remove ÿþ from string

When I run iconv(mb_detect_encoding($fileData), 'UTF-8', $fileData); I get:

iconv(): Detected an illegal character in input string.

If I do iconv('UTF-16', 'UTF-8', $fileData) instead I get:

iconv(): Detected an incomplete multibyte character in input

CodePudding user response:

So it seems the þ was an incomplete multibyte string. I fixed this using the command below to remove the incomplete multibyte strings.

$fileData = mb_convert_encoding($fileData, 'UTF-8', 'UTF-8');

This left a ? where the þ originally was, I then removed this using the following.

$fileData = str_replace("\n?\n\$NoCode", '', $fileData);

CodePudding user response:

str_replace should be faster than preg_replace Here is an example:

$input = file_get_contents('input.txt');
$output = str_replace(['þ','$NoCode'], '', $input);
file_put_contents('output.txt', $output);

Or if you want get rid of empty lines too:

$input = file_get_contents('input.txt');
$output = str_replace(["þ\r\n","\$NoCode\r\n", "þ\n","\$NoCode\n", "þ\r", "\$NoCode\r"], '', $input);
file_put_contents('output.txt', $output);
  • Related