Find out the most popular male/famale name from dataframe-CodePudding

Decision which came to my mind is:

dataset['Name'].loc[dataset['Sex'] == 'female'].value_counts().idxmax()

But here is not such ordinary decision because there are names of female's husband after Mrs and i need to somehowes split it

Input data:

df = pd.DataFrame({'Name': ['Braund, Mr. Owen Harris', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'Heikkinen, Miss. Laina', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'Allen, Mr. William Henry', 'Moran, Mr. James', 'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard', 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)', 'Nasser, Mrs. Nicholas (Adele Achem)'],
                   'Sex': ['male', 'female', 'female', 'female', 'male', 'male', 'male', 'male', 'female', 'female'],
                   })



Task 4: Name the most popular female name on the ship.
'some code'
Output: Anna      #The most popular female name
Task 5: Name the most popular male name on the ship.
'some code'
Output: Wilhelm   #The most popular male name

CodePudding user response：

Quick and dirty would be something like:

from collections import Counter

# Random list of names
your_lst = ["Mrs Braun", "Allen, Mr. Timothy J", "Allen, Mr. Henry William"]

# Split names by space, and flatten the list.      
your_lst_flat = [item for sublist in [x.split(" ") for x in your_lst ] for item in sublist]

# Count occurrences. With this you will get a count of all the values, including Mr and Mrs. But you can just ignore these.
Counter(your_lst_flat).most_common()

CodePudding user response：

IIUC, you can use a regex to extract either the first name, or if Mrs. the name after the parentheses:

s = df['Name'].str.extract(r'((?:(?<=Mr. )|(?<=Miss. )|(?<=Master. ))\w |(?<=\()\w )',
                           expand=False)
s.groupby(df['Sex']).value_counts()

output:

Sex     Name     
female  Adele        1
        Elisabeth    1
        Florence     1
        Laina        1
        Lily         1
male    Gosta        1
        James        1
        Owen         1
        Timothy      1
        William      1
Name: Name, dtype: int64

regex demo

once you have s, to get the most frequent female name(s):

s[df['Sex'].eq('female')].mode()