How do I remove the generational suffix from last name (if there is one) using SAS?-CodePudding

In my dataset, the last name (lname) occasionally has the generational suffix attached. Regarding the generational suffix:

there are no spaces or other possible delimiters between the lname variable and the suffix
the suffix ranges between 2 and 4 characters in length
the suffix is a mix of lowercase, uppercase, and proper case
the suffix sometimes includes a combination of integers and characters

I tried to think simple solutions first. I couldn't think of any using Excel because all of their string solutions require having a consistent position of the values to be removed.

In SAS, PARSE requires a delimiter, and TRIM requires a consistent position.

In the syntax I've attached are four different approaches I tried. None of them were successful, and I totally admit user error. I'm not familiar with any of them other than COMPRESS, and then only for removing blanks.

Is there a way I can make a new variable for last name that doesn't have the generational suffix attached?

Thank you so much!

This first piece applies to each of the my attempts.

data want;
    input id lname $ fname $;
    datalines;
        123456  Smith       John
        234567  SMITH       ANDREW
        345678  SmithJr     Alan
        456789  SMITHSR     SAM
        789012  smithiii    robert
        890123  smithIIII   william
        901234  Smith4th    Tim
        ;
run;

My attempts start here.

/* COMPRESS */
data want;
    set have;
    lname2 = compress(lname,'Jr');
    put string=;
run;

/* TRANWARD */
data want;
    set have;
    lname2 = tranwrd(lname,"Jr", "");
    lname2 = tranwrd(lname,"Sr", "");
    lname2 = tranwrd(lname,"III", "");
run;

/* PRXCHANGE */
data want;
    set have;
    lname2 = lname;
    lname2 = prxchange('s/(.*)(jr|sr|iii|iv)$/$1/i',1,trim(lname));
run;

/* PRXMATCH */
data want;
    set have;
    if prxmatch('/Jr|Sr|III/',lname) then lname2 = '';
run;

CodePudding user response：

You can not use compress() for this purpose at all
Instead of tranwrd (it requires a delimiter) you might try to use translate. But you will not solve the problem of replacing your pattern in the beginning or midle of the word
The example of prxmatch is below.

    data have;
        input id lname $ fname $;
        datalines;
            123456  Smith       John
            234567  SMITH       ANDREW
            345678  SmithJr     Alan
            456789  SMITHSR     SAM
            789012  smithiii    robert
            890123  smithIIII   william
            901234  Smith4th    Tim
            901235  SRith4th    Tim
            ;
    run;
         
    data want;
        set have;
    
          /* Use PRXPARSE to compile the Perl regular expression.    */
       patternID=prxparse('/(JR$)|(SR$)|(III$)/');
          /* Use PRXMATCH to find the position of the pattern match. */
       position=prxmatch(patternID, compress(upcase(lname)));
       put position=;
       if position then do;
         put lname=;
         lname2 = '';
       end;
    run;

CodePudding user response：

I think you're fine with your prxchange method, for me it's the most reliable and easy to maintain, I would just change 2 things:

us the 'o' modifier to compile only once the regex
Use a strip instead of a trim (strip is an equivalent of ltrim rtrim)

data want;
    set have;
    attrib lname2 format=$50.;
    lname2 = prxchange('s/(.*)(jr|sr|iii|iv)$/$1/oi', 1, strip(lname));
run;