Extract SKU values which may be numeric or alphanumeric and must be 4 to 20 characters long-CodePudding

I am open to including more code than just a regular expression.

I am writing some code that takes a picture, runs a couple Imagick filters, then a tesseractOCR pass, to output text.

From that text, I am using a regex with PHP to extract a SKU (model number for a product) and output the results into an array, which is then inserted to a table.

All is well, except that in my expression I'm using now:

\w[^a-z\s\/?!@#-$%^&*():;.,œ∑´®†¥¨ˆøπåß∂ƒ©˙∆˚¬Ω≈ç√∫˜µ≤≥]{4,20}

I will still get back some strings which contain ONLY letters.

The ultimate goal:

-strings that may contain uppercase letters and numbers,
-strings that contain only numbers,
-strings that do not contain only letters,
-strings which do not contain any lowercase letters,
-these strings must be between 4-20 characters

as an example:

a SKU could be 5209, or it could also be WRE5472UFG5621.

CodePudding user response：

Until the regex maestros show up, a lazy person such as myself would just do two rounds on this and keep it simple. First, match all strings that are only A-Z, 0-9 (rather than crafting massive no-lists or look-abouts). Then, use preg_grep() with the PREG_GREP_INVERT flag to remove all strings that are A-Z only. Finally, filter for unique matches to eliminate repeat noise.

$str = '-9 Cycles 3 Temperature Levels Steam Sanitizet  -Sensor Dry | ALSO AVAILABLE (PRICES MAY VARY) |- White - 1258843 - DVE45R6100W {  Platinum - 1501 525 - DVE45R6100P desirable: 1258843 DVE45R6100W';

$wanted = [];

// First round: Get all A-Z, 0-9 substrings (if any)
if(preg_match_all('~\b[A-Z0-9]{6,24}\b~', $str, $matches)) {

    // Second round: Filter all that are A-Z only
    $wanted = preg_grep('~^[A-Z] $~', $matches[0], PREG_GREP_INVERT);

    // And remove duplicates:
    $wanted = array_unique($wanted);
}

Result:

array(3) {
    [2] · string(7) "1258843"
    [3] · string(11) "DVE45R6100W"
    [4] · string(11) "DVE45R6100P"
}

Note that I've increased the match length to {6,24} even though you speak of a 4-character match, since your sample string has 4-digit substrings that were not in your "desirable" list.

Edit: I've moved the preg_match_all() into a conditional construct containing the the remaining ops, and set $wanted as an empty array by default. You can conveniently both capture matches and evaluate if matched in one go (rather than e.g. have if(!empty($matches))).

CodePudding user response：

Okay, you have accepted a suboptimal answer since I've asked for question improvement in a comment under the question. I'll interpret this to mean that you have no intention of clarifying the question further and the other answer works as desired. For this reason, I'll offer a single regex solution so that you don't need to need to use iterated regex filtering after making an initial regex extraction.

For your limited sample data, your requirement boils down to:

Match whole "words" (visible characters separated by spaces) which:

consist of numeric or alphanumeric strings and
are a length between 4 and 20 characters.

You can subsequently eliminate duplicated matched strings with array_unique() if desirable.

Code: (Demo)

$str = '-9 Cycles 3 Temperature Levels Steam Sanitizet  -Sensor Dry | ALSO AVAILABLE (PRICES MAY VARY) |- White - 1258843 - DVE45R6100W {  Platinum - 1501 525 - DVE45R6100P desirable: 1258843 DVE45R6100W';

if (preg_match_all('~\b(?=\S*\d)[A-Z\d]{4,20}\b~', $str, $m)) {
    var_export(array_unique($m[0]));
}

Output:

array (
  0 => '1258843',
  1 => 'DVE45R6100W',
  2 => '1501',
  3 => 'DVE45R6100P',
)

Pattern Breakdown:

\b             #the zero-width position between a character matched by \W and a character matched by \w
(?=\S*\d)      #lookahead to ensure that the following "word" contains at least 1 digit
[A-Z\d]{4,20}  #match between 4 and 20 digits or uppercase letters
\b             #the zero-width position between a character matched by \W and a character matched by \w

The equivalent non-regex process (which I do not endorse) is: (Demo)

foreach (explode(' ', $str) as $word) {
    $length = strlen($word);
    if ($length >= 4                    // has 4 characters or more
        && $length <= 20                // has 20 characters or less
        && !isset($result[$word])       // not yet in result array
        && ctype_alnum($word)           // comprised numbers and/or letters only
        && !ctype_alpha($word)          // is not comprised solely of letters
        && $word === strtoupper($word)  // has no lowercase letters
    ) {
        $result[$word] = $word;
    }
}
var_export(array_values($result));