How to replace text in quotes with the same equal length asterisks in SAS? I mean, convert:
"12345"
"hi42"
'with "double" quotes'
there are 'other words' not in quotes
to:
*******
******
**********************
there are ************* not in quotes
There are 7,6,22,13 asterisks in line 1,2,3,4 separately. Yes, quotes themself are included, too.
I tried program like this:
pat=prxparse('/[''"].*?["'']/');
do until(pos=0);
call prxsubstr(pat,text,pos,len);
if pos then substr(text,pos,len)=repeat('*',len-1);
end;
It works.
My question is: Is there a more efficient way to do this?
CodePudding user response:
I would skip regex and just use CALL SCAN() instead.
So loop through finding the location of the next "word". If the word begins and ends with a quote then replace the word with *'s.
data have;
input string $char80. ;
cards;
"12345"
"hi42"
'with "double" quotes'
there are 'other words' not in quotes
What's going on?
;
data want;
set have;
position=1;
do count=1 by 1 while(position>0);
call scan(string,count,position,length,' ','q');
if char(string,position) in ('"',"'")
and char(string,position)=char(string,position length-1)
then substr(string,position,length) = repeat('*',length-1)
;
end;
drop position count length;
run;
Result
Obs string
1 *******
2 ******
3 **********************
4 there are ************* not in quotes
5
6 What's going on?
CodePudding user response:
First off, your example fails on the third expression, because it doesn't remember what the opening quote was - so it leaves "double" unmatched.
You can solve that with a backreference, which is supported by SAS:
data have;
length text $1024;
infile datalines pad;
input @1 text $80.;
datalines;
"12345"
"hi42"
'with "double" quotes'
there are 'other words' not in quotes
;;;;
run;
data want;
set have;
pat=prxparse('/([''"]).*?\1/');
do until(pos=0);
call prxsubstr(pat,text,pos,len);
if pos then substr(text,pos,len)=repeat('*',len-1);
end;
run;
Efficiency wise, this takes about 1.5 seconds on my (fairly fast but not exceptionally so) SAS server to handle 400k records (these 4 x 100,000). This seems reasonable, unless your text is much bigger or your row count much larger. Also, note this will fail on highly complicated nesting if that's permissible (single-double-single etc., or double-single inside single won't be recognized, though it probably will still work fine for your intentions).
However, if you want most efficient, regular expressions are not the answer - it is more efficient to use basic text functions. It's harder to get it exactly right though, and takes a lot more code, so I don't suggest doing this if the regex is acceptable performance. But here's one example - you may need to tweak it some, and you'll need to loop it to repeat until it doesn't find any to replace, and not execute it if there are no quotes at all. This just gives the basic idea of how to use the text functions.
data want;
set have;
length text_sub $80;
_start = findc(text,'"''');
_qchar = char(text,_start); *Save aside which char we matched on;
_end = findc(text,_qchar,_start 1); *now look for that one again anywhere after the first match;
to_convert = substr(text,_start,_end-_start 1);
if _start eq 1 and _end eq length(text) then text_sub = repeat('*',_end-1);
else if _start eq 1 then text_sub = substr(text,_end 1);
else if _end eq length(text) then text_sub = substr(text,1,_start-1)||repeat('*',_end-_start);
else text_sub = cat(substr(text,1,_start-1),repeat('*',_end-_start),substr(text,_end 1));
run;