Home > Mobile >  Recoding a string variable into categories in Python similar to how you can do it in SAS
Recoding a string variable into categories in Python similar to how you can do it in SAS

Time:01-06

I have a very long list of string insurance types that I am categorizing. A few are standard, but there is a long tail for which I would like to categorize by the string insurance type containing certain terms.

For example, if the string contains "HMO" or "PPO", the category will be "private" (for "private insurance"). If the string contains "Medicaid" or "Medicare", it will be "public", and if it contains "No Insurance" or "Charity" it will be categorized as "self pay".

Some categories are already populated, as below, and other categories are null.

data = {'category': ['public', 'private', 'self pay', '', '', ''],
'raw': ['Medicaid', 'Blue Cross Blue Shield', 'No Insurance', 'blah blah PPO or something', 'HMO with a lot of weird stuff', 'blah blah charity']
        }

df = pd.DataFrame(data)

print(df)

In SAS, here is an example of how to do this. The findw function looks at "raw" for the term in quotes after it and returns the starting position of the string. If the position is greater than 0, the text exists in raw. If the position = 0, then the text doesn't exist in raw.

(Note, I don't care about the position of the text string, just that it exists. This is just a way to do it in SAS.)

IF category is NULL then do:
    IF raw = 'Medicaid' THEN category = 'public' ;
    ELSE IF raw = 'Blue Cross Blue Shield' THEN category = 'private';
    ELSE IF findw(raw, 'Sliding Scale') > 0 THEN category = 'self pay';
    ELSE IF findw(raw, 'PPO') > 0 THEN category = 'private' ;    
    ELSE IF findw(raw, 'HMO') > 0 THEN category = 'private' ;   
end ;

How can I create categories in a new variable from the raw variable dependent on the presence of terms (with str.contains ?) in Python similar to the SAS code above?

OR is there an even better, pythonic way?

Many thanks.

CodePudding user response:

You can define a function that does this and apply it to the dataframe.

def remap_categories(row):
    if row == 'Medicaid':
        return 'public'
    elif row == 'Blue Cross Blue Shield':
        return 'private'
    elif 'Sliding Scale' in row:
        return 'self pay'
    elif 'PPO' in row:
        return 'private'
    elif 'HMO' in row:
        return 'private'

df.loc[df['category'] == '', 'category'] = df['raw'].apply(remap_categories)

Output:

   category                            raw
0    public                       Medicaid
1   private         Blue Cross Blue Shield
2  self pay                   No Insurance
3   private     blah blah PPO or something
4   private  HMO with a lot of weird stuff
5      None              blah blah charity
  • Related