In my dataset, the last name (lname) occasionally has the generational suffix attached. Regarding the generational suffix:
- there are no spaces or other possible delimiters between the lname variable and the suffix
- the suffix ranges between 2 and 4 characters in length
- the suffix is a mix of lowercase, uppercase, and proper case
- the suffix sometimes includes a combination of integers and characters
I tried to think simple solutions first. I couldn't think of any using Excel because all of their string solutions require having a consistent position of the values to be removed.
In SAS, PARSE requires a delimiter, and TRIM requires a consistent position.
In the syntax I've attached are four different approaches I tried. None of them were successful, and I totally admit user error. I'm not familiar with any of them other than COMPRESS, and then only for removing blanks.
Is there a way I can make a new variable for last name that doesn't have the generational suffix attached?
Thank you so much!
This first piece applies to each of the my attempts.
data want;
input id lname $ fname $;
datalines;
123456 Smith John
234567 SMITH ANDREW
345678 SmithJr Alan
456789 SMITHSR SAM
789012 smithiii robert
890123 smithIIII william
901234 Smith4th Tim
;
run;
My attempts start here.
/* COMPRESS */
data want;
set have;
lname2 = compress(lname,'Jr');
put string=;
run;
/* TRANWARD */
data want;
set have;
lname2 = tranwrd(lname,"Jr", "");
lname2 = tranwrd(lname,"Sr", "");
lname2 = tranwrd(lname,"III", "");
run;
/* PRXCHANGE */
data want;
set have;
lname2 = lname;
lname2 = prxchange('s/(.*)(jr|sr|iii|iv)$/$1/i',1,trim(lname));
run;
/* PRXMATCH */
data want;
set have;
if prxmatch('/Jr|Sr|III/',lname) then lname2 = '';
run;
CodePudding user response:
- You can not use compress() for this purpose at all
- Instead of tranwrd (it requires a delimiter) you might try to use translate. But you will not solve the problem of replacing your pattern in the beginning or midle of the word
- The example of prxmatch is below.
data have;
input id lname $ fname $;
datalines;
123456 Smith John
234567 SMITH ANDREW
345678 SmithJr Alan
456789 SMITHSR SAM
789012 smithiii robert
890123 smithIIII william
901234 Smith4th Tim
901235 SRith4th Tim
;
run;
data want;
set have;
/* Use PRXPARSE to compile the Perl regular expression. */
patternID=prxparse('/(JR$)|(SR$)|(III$)/');
/* Use PRXMATCH to find the position of the pattern match. */
position=prxmatch(patternID, compress(upcase(lname)));
put position=;
if position then do;
put lname=;
lname2 = '';
end;
run;
CodePudding user response:
I think you're fine with your prxchange method, for me it's the most reliable and easy to maintain, I would just change 2 things:
- us the 'o' modifier to compile only once the regex
- Use a strip instead of a trim (strip is an equivalent of ltrim rtrim)
data want;
set have;
attrib lname2 format=$50.;
lname2 = prxchange('s/(.*)(jr|sr|iii|iv)$/$1/oi', 1, strip(lname));
run;