I have a question regarding the usage of regex in SAS.
My dataset looks like that:
ID | Code |
---|---|
101 | K2K5K8F10F26F2 |
102 | L7P13P4 |
103 | L1 |
And I would like it to look like this:
ID | Code |
---|---|
101 | K2 |
101 | K5 |
101 | K8 |
101 | F10 |
101 | F26 |
101 | F2 |
102 | L7 |
102 | P13 |
102 | P4 |
103 | L1 |
At the beginning I thought that it is easier to do it first by assigning new columns and then by rows.
My attempt looks as follows:
proc ds2;
data Codes (overwrite=yes);
dcl char(16) code1 code2 code3 code4 code5 code6;
dcl double re;
keep code1 code2 code3 code4 code5 code6;
retain re;
method init();
dcl varchar(32) expression;
expression = '/(\w \d ) /';
re=prxparse(expression);
if missing( re ) then do;
put 'ERROR: Invalid expression ' expression;
stop;
end;
end;
method run();
set mytable;
code1 = 'ERROR';
if prxmatch(re, Code) then
do;
code1=prxposn(re, 0, Code);
code2=prxposn(re, 1, Code);
code3=prxposn(re, 2, Code);
code4=prxposn(re, 3, Code);
code5=prxposn(re, 4, Code);
code6=prxposn(re, 5, Code);
end;
else do;
code1='0';
end;
end;
enddata;
run;
quit;
proc print data=Codes;
run;
quit;
However, nothing changes. As a result I get the code1 and code2 column to be populated exactly like the Code column from the initial dataset. I would really appreciate any help on this as regex is not my strong suit. I also put code1 = 'ERROR' and later on code1 = '0' in order to check if the code works.
It should be noted that I created up to code6 as a trial. I cannot know the exact number of codes per ID. However, I do know that the code must always be one letter in combination with either one or two digits and it can also be in the form Z12-9 (so one letter followed by two digits, followed by a dash, followed by one digit).
Thank you in advance!
CodePudding user response:
Answering the regular expression part of this, your regex is wrong, and I think prxposn
is probably also wrong.
\w
matches numerics as well as alpha, so \w
will grab all of the string. You need to either use [A-Z]
, or use \w ?
to use less aggressive matching to get only the single alpha-then-numeric set.
Also, the right method here is call prxnext
, prxposn
matches each bracket match in the regex, so 1 is the first one, 2 is the second one, but (something)
only is one bracket match. call prxnext
will keep finding more matches of a single match and you can use that to grab the matching bits.
Again here is in simple data step, but DS2 would be similar.
data want;
set have;
rx = prxparse('/[A-Z] \d /ios');
start = 1;
do until (pos eq 0);
call prxnext(rx,start,length(code),code,pos,len);
if pos gt 0 then do;
w = substr(code,pos,len);
put w=;
output;
end;
end;
run;
CodePudding user response:
I find this a particularly good use case for call scan
, regex isn't nearly as efficient. Here I use call scan
to find the "word boundary" of the (always single) letter, then grab it plus whatever's before the next letter (or end-of-word).
data have;
length code $20 ;
input id code $;
datalines;
101 K2K5K8F10F26F2
102 L7P13P4
103 L1
;;;;
run;
data want;
set have;
do count = 1 to countw(code,,'a');
call scan(code,count,pos,len,,'a');
w = substr(code,pos-1,len 1);
output;
end;
run;
I think this would work in DS2 just as well as the data step if that's a requirement.