I'm using the Titanic dataset to learn how to clean data. What I'm trying to do now is to create a new column with titles from the Name column (Mr., Mrs., Dr., etc.) and group them based on social class:
Dr. – Doctor
Rev. – Clerical
Master. – Scholar
Major., Col., Capt. – Military
Mr., Mrs., Ms., Miss. – Commoners
Don., Sir., Mme., Mlle., Lady., Countess., Jonkheer. – Nobility
I've created a function, which is for the most part working, except that all of the values return 'Other
'.
import os
import re
filename = os.path.join(os.path.dirname(__file__),'train.csv')
data = pd.read_csv(filename)
def create_title_column(data, colname):
def search_title(x):
title_search = re.search('\s([A-Za-z] )\.', x)
if title_search:
title = title_search.groups(0)
if title == 'Rev.':
return 'Clerical'
elif title == 'Master.':
return 'Scholar'
elif title == 'Dr.':
return 'Doctor'
elif title in ['Major.', 'Col.', 'Capt.']:
return 'Military'
elif title in ['Mr.','Mrs.','Ms.','Miss.']:
return 'Commoner'
elif title in ['Don.','Sir.','Mme.','Mlle.','Lady.','Countess.','Jonkheer.']:
return 'Nobility'
else:
return 'Other'
return_titles = data[colname].apply(search_title)
dict_title = {1: 'Clerical', 2:'Scholar', 3:'Doctor', 4:'Military',5:'Commoner',6:'Nobility'}
return return_titles.replace(dict_title)
data['Title'] = create_title_column(data, 'Name')
print(data.head(10))
I don't know what I'm doing wrong. Also, if you know of a better, more concise way to do this, you're welcome to show me. Any help would be great, so thank you in advance for your time.
You can download the Titanic dataset here:
CodePudding user response:
There is a perfectly good function for tasks like yours offered from the numpy
package, select
. It is a vectorized solution and it helps keep your code clean. So I wouldn't encourage defining a function for your task.
I would first start by extracting the titles from the Name column (i.e. 'Mr','Miss','Clerical' .. ).
It is important to chain str.stip()
at the end to eliminate unwanted whitespace.
# Grab the titles from the Name column
df['title']=df.Name.str.split(',').str[1].str.split('.').str[0].str.strip()
>>> df['title'].value_counts()
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Mlle 2
Major 2
Col 2
Ms 1
Don 1
Capt 1
Sir 1
Lady 1
the Countess 1
Jonkheer 1
Mme 1
Name: title, dtype: int64
In the next chunk of code, I specify what the current title is, and what each title should be changed to. Please keep in mind that the order matters (i.e. Rev will be Clerical, Master will be Scholar etc)
# Import numpy package
import numpy as np
# Set the current titles that you have in the column
current_titles = [
df['title'] == 'Rev',\
df['title'] == 'Master',\
df['title'] == 'Dr',\
df['title'].isin(['Major', 'Col', 'Capt']),\
df['title'].isin(['Mr','Mrs','Ms','Miss']),\
df['title'].isin(['Don','Sir','Mme','Mlle','Lady','Countess','Jonkheer'])\
]
# Set the new title to each title.
title_changes = ['Clerical','Scholar','Doctor','Military','Commoner','Nobility']
Having specified your conditions
(current_titles) and your choices
(title_changes), which are 2 of the 3 parameters that np.select
takes, you can create your new column with the updated data:
df["title_refined"] = np.select(current_titles, title_changes, default="Other")
Printing the 3 columns, to clearly illustrate the differences, yields your expected output:
>>> df[['Name','title','title_refined']]
Name title title_refined
0 Braund, Mr. Owen Harris Mr Commoner
1 Cumings, Mrs. John Bradley (Florence Briggs Th... Mrs Commoner
2 Heikkinen, Miss. Laina Miss Commoner
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Mrs Commoner
4 Allen, Mr. William Henry Mr Commoner
.. ... ... ...
886 Montvila, Rev. Juozas Rev Clerical
887 Graham, Miss. Margaret Edith Miss Commoner
888 Johnston, Miss. Catherine Helen "Carrie" Miss Commoner
889 Behr, Mr. Karl Howell Mr Commoner
890 Dooley, Mr. Patrick Mr Commoner