Home > Back-end >  RegEx Results in new Column do not work properly (Pandas)
RegEx Results in new Column do not work properly (Pandas)

Time:10-03

I'm using the Titanic dataset to learn how to clean data. What I'm trying to do now is to create a new column with titles from the Name column (Mr., Mrs., Dr., etc.) and group them based on social class:

Dr. – Doctor
Rev. – Clerical
Master. – Scholar 
Major., Col., Capt. – Military 
Mr., Mrs., Ms., Miss. – Commoners
Don., Sir., Mme., Mlle., Lady., Countess., Jonkheer. – Nobility 

I've created a function, which is for the most part working, except that all of the values return 'Other'.

import os
import re

filename = os.path.join(os.path.dirname(__file__),'train.csv')
data = pd.read_csv(filename)

def create_title_column(data, colname):
    def search_title(x):
        title_search = re.search('\s([A-Za-z] )\.', x)
        if title_search:
            title = title_search.groups(0)
            if title == 'Rev.':
                return 'Clerical'
            elif title == 'Master.':
                return 'Scholar'
            elif title == 'Dr.':
                return 'Doctor'
            elif title in ['Major.', 'Col.', 'Capt.']:
                return 'Military'
            elif title in ['Mr.','Mrs.','Ms.','Miss.']:
                return 'Commoner'
            elif title in ['Don.','Sir.','Mme.','Mlle.','Lady.','Countess.','Jonkheer.']:
                return 'Nobility'
            else:
                return 'Other'

    return_titles = data[colname].apply(search_title)
    dict_title = {1: 'Clerical', 2:'Scholar', 3:'Doctor', 4:'Military',5:'Commoner',6:'Nobility'}
    return return_titles.replace(dict_title)
data['Title'] = create_title_column(data, 'Name')

print(data.head(10))

I don't know what I'm doing wrong. Also, if you know of a better, more concise way to do this, you're welcome to show me. Any help would be great, so thank you in advance for your time.

You can download the Titanic dataset here: enter image description here

CodePudding user response:

There is a perfectly good function for tasks like yours offered from the numpy package, select. It is a vectorized solution and it helps keep your code clean. So I wouldn't encourage defining a function for your task.

I would first start by extracting the titles from the Name column (i.e. 'Mr','Miss','Clerical' .. ).

It is important to chain str.stip() at the end to eliminate unwanted whitespace.

# Grab the titles from the Name column
df['title']=df.Name.str.split(',').str[1].str.split('.').str[0].str.strip()
>>> df['title'].value_counts()

 Mr              517
 Miss            182
 Mrs             125
 Master           40
 Dr                7
 Rev               6
 Mlle              2
 Major             2
 Col               2
 Ms                1
 Don               1
 Capt              1
 Sir               1
 Lady              1
 the Countess      1
 Jonkheer          1
 Mme               1
Name: title, dtype: int64

In the next chunk of code, I specify what the current title is, and what each title should be changed to. Please keep in mind that the order matters (i.e. Rev will be Clerical, Master will be Scholar etc)

# Import numpy package
import numpy as np

# Set the current titles that you have in the column
current_titles  = [
              df['title'] == 'Rev',\
              df['title'] == 'Master',\
              df['title'] == 'Dr',\
              df['title'].isin(['Major', 'Col', 'Capt']),\
              df['title'].isin(['Mr','Mrs','Ms','Miss']),\
              df['title'].isin(['Don','Sir','Mme','Mlle','Lady','Countess','Jonkheer'])\
              ]
# Set the new title to each title.
title_changes     = ['Clerical','Scholar','Doctor','Military','Commoner','Nobility']

Having specified your conditions (current_titles) and your choices (title_changes), which are 2 of the 3 parameters that np.select takes, you can create your new column with the updated data:

df["title_refined"] = np.select(current_titles, title_changes, default="Other")

Printing the 3 columns, to clearly illustrate the differences, yields your expected output:

>>> df[['Name','title','title_refined']]

                                                  Name title title_refined
0                              Braund, Mr. Owen Harris    Mr      Commoner
1    Cumings, Mrs. John Bradley (Florence Briggs Th...   Mrs      Commoner
2                               Heikkinen, Miss. Laina  Miss      Commoner
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)   Mrs      Commoner
4                             Allen, Mr. William Henry    Mr      Commoner
..                                                 ...   ...           ...
886                              Montvila, Rev. Juozas   Rev      Clerical
887                       Graham, Miss. Margaret Edith  Miss      Commoner
888           Johnston, Miss. Catherine Helen "Carrie"  Miss      Commoner
889                              Behr, Mr. Karl Howell    Mr      Commoner
890                                Dooley, Mr. Patrick    Mr      Commoner
  • Related