Regular expression for Swift X character set in Oracle-CodePudding

As per the swift documentation, The SWIFT X character set consists of the below

X Character Set – SWIFT Character Set

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
/ – ? : ( ) . , ‘   CrLf Space

I have come up with the below to validate the swift character set which seems to be working but want to know if there is a better way of doing it.Also what should I use for CRLF to be OS neutral.Since i use unix I have put chr(10)

^[a-zA-Z0-9 -?:().,'' chr(10)/]*$

CodePudding user response：

Unfortunately, a range like a-z may include accented letters and collation elements, depending on the value of nls_sort at the time of running a query. And, alas, Oracle does not support the character class [[:ascii:]], which would be another way to specify what you need.

You have two choices. Either you specify the nls_sort parameter explicitly every time, before running the query (or rely on it being something like 'English' already), which to me doesn't sound like a good practice; or you specify all letters explicitly.

There are a few more things to fix. The dash - has special meaning in a bracketed expression; if you want it to mean a literal dash, it should appear as either the first or the last character in the list, where it can't have its special meaning. All other regexp special characters are not special in a bracketed expression, so you don't need to worry about dot, question mark, asterisk, parentheses, etc.

However, note that the single-quote character, while it has no special meaning in a regular expression (in a bracketed expression or otherwise), it does have a special meaning in a string in Oracle; to include a single-quote in a hard-coded string, you must escape it by typing two single-quote characters.

Then - if you write chr(10) in a bracketed expression, that is characters c, h, ... - if you mean LF, you need to either actually include a newline character in your string (probably a bad idea), or concatenate it by hand.

And if you want to validate against the official character set of "swift x" (whatever that is), you should include all characters, regardless of your OS. So you should accept CR (chr(13)) too, unless you have a better reason to omit it. If it is present but you don't want it in your db, you should accept it and then remove it after the fact and save the resulting string (after you remove CR), not reject the entire string altogether.

To keep the work organized, I would create a very small table (or view) to store the needed validation string, then use it in all queries that need it.

Something like this:

create table swift_validation (validation_pattern varchar2(100));

insert into swift_validation (validation_pattern)
  with helper(ascii_letters) as (
    select 'abcdefghijklmnopqrstuvwxyz' from dual
  )
  select '^[' || ascii_letters             --  a-z (ASCII)
              || upper(ascii_letters)      --  A-Z (ASCII)
              || '0-9'
              || chr(10) || chr(13)        --  LF and CR
              || '/?:().,''  -'
              || ']*$'
  from   helper
;

commit;

Check what was saved in the table:

select * from swift_validation;

VALIDATION_PATTERN                                          
------------------------------------------------------------
^[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0-9

/?:().,'  -]*$

Note that the result is in three lines. chr(10) is seen as a newline; then chr(13) by itself is converted to another newline.

In any case, if you really want to see the exact characters saved in this string, you can use the dump function. With option 17 for the second argument, the output is readable (you will have to scroll though):

select dump(validation_pattern, 17) from swift_validation;

DUMP(VALIDATION_PATTERN,17)                                                                                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Typ=1 Len=73: ^,[,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,0,-,9,^J,^M,/,?,:,(,),.,,,', , ,-,],*,$

Notice in particular the control characters, ^J and ^M; they mean chr(10) and chr(13) respectively (easy to remember: J and M are the tenth and thirteenth letters of the Latin alphabet).

Then you use this as follows:

with
  test_strings (str) as (
    select 'abc   (12)'    from dual union all
    select '$122.38'       from dual union all
    select null            from dual union all
    select 'café au lait'  from dual union all
    select 'A / B - C * D' from dual
  )
select t.str,
       case when regexp_like(t.str, sv.validation_pattern)
            then 'valid' else 'invalid' end as swift_valid
from   test_strings t, swift_validation sv
;

STR            SWIFT_VALID
-------------  -----------
abc   (12)     valid  
$122.38        invalid
               invalid
café au lait   invalid
A / B - C * D  invalid

Notice one last oddity here. In my test, I included a row where the input string is empty (null). Regular expressions are odd in this respect: null is not (regexp_) like something like 'a*' - even though * is supposed to mean "zero or more ...". Oracle's reasoning, perhaps, is that null may be anything - just one of the hundreds of ways the Oracle identification of null and "empty string" is just plain idiotic. It is what it is though; make sure you don't reject a row with an empty string. I assume "swift x" allows empty strings. You will need to handle that separately, like this:

with
  test_strings (str) as (
    select 'abc   (12)'    from dual union all
    select '$122.38'       from dual union all
    select null            from dual union all
    select 'café au lait'  from dual union all
    select 'A / B - C * D' from dual
  )
select t.str,
       case when t.str is null
              or regexp_like(t.str, sv.validation_pattern)
            then 'valid' else 'invalid' end as swift_valid
from   test_strings t, swift_validation sv
;

STR            SWIFT_VALID
-------------  -----------
abc   (12)     valid  
$122.38        invalid
               valid
café au lait   invalid
A / B - C * D  invalid

Left as exercise:

You may need to find the invalid characters in an invalid string. For such generalized applications (more than a straight validation of a whole string), you might be better off saving just the bracketed expression in the swift_validation table (without the leading anchor ^, and the trailing quantifier * and anchor $). Then you need to re-write the validation query slightly, to concatenate these fragments to the validation pattern in the regexp_like condition; but then you can include, for example, an additional column to show the first invalid character in an invalid string.

EDIT

In follow-up discussion (see comments below this answer), the OP clarified that only the combination chr(13) || chr(10) (in that order) is permitted. chr(10) and chr(13) are invalid if they appear by themselves, or in the wrong order.

This makes the problem more interesting (more complicated). To allow only the letters a, b, c or the sequence xy (that is: x alone, or y alone, are not allowed; every x must appear followed immediately by y, and every y must appear immediately preceded by x), the proper matching pattern looks like

'^([abc]|xy)*$'

Here expr1|expr2 is alternation, and it needs to be enclosed in parentheses to apply the * quantifier.

An additional complication is that $ doesn't actually match "the end of the input string"; it anchors either at the end of the input string, or if the input string ends in newline (chr(10)), it anchors before that character. Happily, there is the alternative anchor \z that doesn't suffer from that defect; it anchors truly at the end of the input string. This will be needed if we don't want to validate input strings that end in chr(10) not preceded immediately by chr(13). (If we do want to allow those - even though technically they do violate the "swift x" rules - then replace \z with $ as we had it before).

Here I demonstrate a slightly modified approach - now the small table that stores the validation rule only contains the alternation bit - either one character out of an enumeration, or the two-character sequence chr(13) || chr(10)), letting the "caller" wrap this within whatever is needed for a complete matching pattern.

The small table (note that I changed the column name):

drop table swift_validation purge;

create table swift_validation (valid_patterns varchar2(100));

insert into swift_validation (valid_patterns)
  with helper(ascii_letters) as (
    select 'abcdefghijklmnopqrstuvwxyz' from dual
  )
  select    '['                           -- open bracketed expression
                || ascii_letters          --  a-z (ASCII)
                || upper(ascii_letters)   --  A-Z (ASCII)
                || '0-9'
                || '/?:().,''  -'         --  '' escape for ', - last
         || ']'                           -- close bracketed expression
         || '|'                           -- alternation
                || chr(13) || chr(10)     --  CR LF
  from   helper
;

commit;

Testing (notice the modified match pattern: now the ^ and \z anchors, the parentheses and the * quantifier are hard-coded in the query, not in the saved string):

with
  test_strings (id, str) as (
    select 1, 'abc   (12)'                from dual union all
    select 2, '$122.38'                   from dual union all
    select 3, null                        from dual union all
    select 4, 'no_underline'              from dual union all
    select 5, 'A / B - C * D'             from dual union all
    select 6, 'abc' || chr(10) || chr(13) from dual union all
    select 7, 'abc' || chr(10)            from dual union all
    select 8, 'abc' || chr(13) || chr(10) from dual union all
    select 9, 'café au lait'              from dual
  )
select t.id, t.str,
       case when t.str is null
              or regexp_like(t.str, '^(' || sv.valid_patterns || ')*\z')
            then 'valid' else 'invalid' end as swift_valid
from   test_strings t, swift_validation sv
;

ID  STR            SWIFT_VALID
--  -------------  -----------
 1  abc   (12)     valid      
 2  $122.38        invalid    
 3                 valid      
 4  no_underline   invalid    
 5  A / B - C * D  invalid    
 6  abc            invalid    
    
                           
 
 7  abc            invalid    
 
 8  abc            valid      

 9  café au lait   invalid

The newline characters (CR and LF) aren't clearly visible in the output; I added an id column so you can reference the output according to the input in the with clause.