Using regex to split a sting into multiple variables SAS-CodePudding

I have a question regarding the usage of regex in SAS.

My dataset looks like that:

ID	Code
101	K2K5K8F10F26F2
102	L7P13P4
103	L1

And I would like it to look like this:

ID	Code
101	K2
101	K5
101	K8
101	F10
101	F26
101	F2
102	L7
102	P13
102	P4
103	L1

At the beginning I thought that it is easier to do it first by assigning new columns and then by rows.

My attempt looks as follows:

proc ds2;
data Codes (overwrite=yes);
dcl char(16) code1 code2 code3 code4 code5 code6;
dcl double re;
keep code1 code2 code3 code4 code5 code6;
retain re;

 method init();
      dcl varchar(32) expression;
      expression = '/(\w \d ) /';
      re=prxparse(expression);
      if missing( re ) then do;
         put 'ERROR: Invalid expression ' expression;
         stop;
      end;
end;



method run();
      set mytable;
      code1 = 'ERROR';
      if prxmatch(re, Code) then
         do;
            code1=prxposn(re, 0, Code);
            code2=prxposn(re, 1, Code);
            code3=prxposn(re, 2, Code);
            code4=prxposn(re, 3, Code);
            code5=prxposn(re, 4, Code);
            code6=prxposn(re, 5, Code);
         end;
       else do;
           code1='0';
        end;
   end;
enddata;
run;
quit;

proc print data=Codes;
run;
quit;

However, nothing changes. As a result I get the code1 and code2 column to be populated exactly like the Code column from the initial dataset. I would really appreciate any help on this as regex is not my strong suit. I also put code1 = 'ERROR' and later on code1 = '0' in order to check if the code works.

It should be noted that I created up to code6 as a trial. I cannot know the exact number of codes per ID. However, I do know that the code must always be one letter in combination with either one or two digits and it can also be in the form Z12-9 (so one letter followed by two digits, followed by a dash, followed by one digit).

Thank you in advance!

CodePudding user response：

Answering the regular expression part of this, your regex is wrong, and I think prxposn is probably also wrong.

\w matches numerics as well as alpha, so \w will grab all of the string. You need to either use [A-Z], or use \w ? to use less aggressive matching to get only the single alpha-then-numeric set.

Also, the right method here is call prxnext, prxposn matches each bracket match in the regex, so 1 is the first one, 2 is the second one, but (something) only is one bracket match. call prxnext will keep finding more matches of a single match and you can use that to grab the matching bits.

Again here is in simple data step, but DS2 would be similar.

data want;
  set have;
  rx = prxparse('/[A-Z] \d /ios');
  
  start = 1;
  
  do until (pos eq 0);
    call prxnext(rx,start,length(code),code,pos,len);
    if pos gt 0 then do;
      w = substr(code,pos,len);
      put w=;
      output;
    end;
  end;
run;

CodePudding user response：

I find this a particularly good use case for call scan, regex isn't nearly as efficient. Here I use call scan to find the "word boundary" of the (always single) letter, then grab it plus whatever's before the next letter (or end-of-word).

data have;
length code $20 ;
input id code $;
datalines;
101 K2K5K8F10F26F2 
102 L7P13P4 
103 L1
;;;;
run;
data want;
  set have;
  do count = 1 to countw(code,,'a');
    call scan(code,count,pos,len,,'a');
    w = substr(code,pos-1,len 1);
    output;
  end;
run;

I think this would work in DS2 just as well as the data step if that's a requirement.