Home > Software engineering >  how to clean a dirty csv string using php regex
how to clean a dirty csv string using php regex

Time:01-18

my string may be like this:

@ *lorem.jpg,,, ip sum.jpg,dolor ..jpg,-/ ?

in fact - it is a dirty csv string - having names of jpg images

I need to remove any non-alphanum chars - from both sides of the string
then - inside the resulting string - remove the same - except commas and dots
then - remove duplicates commas and dots - if any - replace them with single ones

so the final result should be:
lorem.jpg,ipsum.jpg,dolor.jpg

I firstly tried to remove any white space - anywhere

$str = str_replace(" ", "", $str);  

then I used various forms of trim functions - but it is tedious and a lot of code

the additional problem is - duplicates commas and dots may have one or more instances - for example - .. or ,,,,

is there a way to solve this using regex, pls ?

CodePudding user response:

List of modeled steps following your words:

Step 1

  • "remove any non-alphanum chars from both sides of the string"

  • translated: remove trailing and tailing consecutive [^a-zA-Z0-9] characters

  • regex: replace ^[^a-zA-Z0-9]*(.*?)[^a-zA-Z0-9]*$ with $1

Step 2

  • "inside the resulting string - remove the same - except commas and dots"
  • translated: remove any [^a-zA-Z0-9.,]
  • regex: replace [^a-zA-Z0-9.,] with empty string

Step 3

  • "remove duplicates commas and dots - if any - replace them with single ones"
  • translated: replace consecutive [,.] as a single instance
  • regex: replace (\.{2,}) with .
  • regex: replace (,{2,}) with ,

PHP Demo:

https://onlinephp.io/c/512e1

<?php

$subject = " @ *lorem.jpg,,, ip sum.jpg,dolor ..jpg,-/ ?";

$firstStep = preg_replace('/^[^a-zA-Z0-9]*(.*?)[^a-zA-Z0-9]*$/', '$1', $subject);
$secondStep = preg_replace('/[^a-z,A-Z0-9.,]/', '', $firstStep);
$thirdStepA = preg_replace('(\.{2,})', '.', $secondStep);
$thirdStepB = preg_replace('(,{2,})', ',', $thirdStepA);

echo $thirdStepB; //lorem.jpg,ipsum.jpg,dolor.jpg

CodePudding user response:

Look at

https://www.php.net/manual/en/function.preg-replace.php

It replace anything inside a string based on pattern. \s represent all space char, but care of NBSP (non breakable space, \h match it )

Exemple 4

$str = preg_replace('/\s\s /', '', $str);

It will be something like that

CodePudding user response:

Can you try this :

$string = ' @ *lorem.jpg,,,,  ip sum.jpg,dolor .jpg,-/ ?';
// this will left only alphanumirics
$result = preg_replace("/[^A-Za-z0-9,.]/", '', $string);

// this will remove duplicated dot and ,
$result = preg_replace('/, /', ',', $result);
$result = preg_replace('/\. /', '.', $result);

// this will remove ,;. and space from the end
$result = preg_replace("/[ ,;.]*$/", '', $result);
  •  Tags:  
  • php
  • Related