How to differenciate this two string with a REGEX?-CodePudding

I am trying to identify Spanish ID numbers using REGEX on MySQL. I am took this regex to adapt it to my dataset, as the items are not isolated and might not start/end with those characters. The expressions are:

Original: ^(x?\d{8}|[xyz]\d{7})[trwagmyfpdxbnjzsqvhlcke]$

Mine:[0-9]{8,8}[A-Za-z]{1}

When I run the search using my REGEX, this is a sample of what I get:

GOOD --> 47099085T
GOOD --> D73654109H
NOT OK --> 8.30781719e-05
NOT OK --> 0113:11:19 00:54:17.042828927Z

How can I modify [0-9]{8,8}[A-Za-z]{1} to exclude the "NOT OK" items?

Spanish ID syntax:

The number of the National Identity Document includes 8 digits and one letter for security. The letter is found by taking all 8 digits as a number and dividing it by 23. The remainder of this digit, which is between 0 and 22, gives the letter used for security. The letters I, Ñ, O, U are not used. The letters I and O are not used – to avoid confusions with the numbers 0 and 1. The Ñ is not used to avoid confusions with N.

Remainder:      0    1    2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22
Letter:         T    R    W   A   G   M   Y   F   P   D   X   B   N   J   Z   S   Q   V   H   L   C   K   E

CodePudding user response：

Ok, according to documentation the Spanish ID system (DNI) is structured thus:

The number of the National Identity Document includes 8 digits and one letter for security. The letter is found by taking all 8 digits as a number and dividing it by 23. The remainder of this digit, which is between 0 and 22, gives the letter used for security. The letters I, Ñ, O, U are not used. The letters I and O are not used – to avoid confusions with the numbers 0 and 1. The Ñ is not used to avoid confusions with N.

Remainder:      0    1    2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22
Letter:         T    R    W   A   G   M   Y   F   P   D   X   B   N   J   Z   S   Q   V   H   L   C   K   E

After some exploration with Negative Lookaheads and completely failing to get them to work, we can use a more manual approach to a solution, by manually checking that the found "block" of 8 integers is not preceeded by an integer or a decimal point:

/[^\.\d][\d]{8}[TRWAGMYFPDXBNJZSQVHLCKE]/gmi

MySQL safe/syntax version:

(^|[^0-9.])([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])

Example usage using REGEX_REPLACE to return rows where the id_column matches the ID syntax and returns those syntax strings:

 SELECT REGEXP_REPLACE(`id_column`, 
    '(^|[^\\d.])(\\d{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$2') as id_output
 FROM `table_name` 
 WHERE id_column REGEXP '(^|[^\\d.])(\\d{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])'

NOTE: Prior to MySQL 8.0.17, the result returned by this function used the UTF-16 character set; in MySQL 8.0.17 and later, the character set and collation of the expression searched for matches is used. (Bug #94203, Bug #29308212)

This matches the two correct matches on your example as well as checking that only one of the valid letters comes after the numerical match.

It is important to note that using the max value in the qualifier {min,max} is pretty irrelevant because it does not mean no more than max should exist in the source string. Please see here for further reading.

What does my Regex do:

Checks that a set of 8 integers is not preceeded by either another integer or a decimal point (so 9 integers are never "captured").
Checks that the set of 8 found integers is immediately followed by one of the valid letters of either case.

You can see my Regex in action here and the corresponding MySQL demo here.

47099085T                         // matches 
D73654109H                        // matches 
8.30781719e-05                    // unmatched 
0113:11:19 00:54:17.042828927Z  // unmatched