I have a dataframe with last name, name and second name:
name
Johnson John William
Peterson Andrew James
Burnham Edward Alexander
....
I want to create new column "initials" which will take only last name and add with underscore first letters of first and second names:
name initials
Johnson John William Johnson_J_W
Peterson Andrew James Peterson_A_J
Burnham Edward Alexander Burnham_E_A
....
How could I do that in short way? I have idea of using split() and than create three columns, extracting first letters from two of them than joining all three again with underscores, but it seems inefficient
CodePudding user response:
Assuming pandas, you can use a simple regex, you will benefit from a vectorized (i.e. fast) string operation:
df['initials'] = df['name'].str.replace(r'\s ([A-Z])[a-z] ', r'_\1', regex=True)
If the case doesn't matter:
df['initials'] = df['name'].str.replace(r'\s(\w)\w ', r'_\1', regex=True)
output:
name initials
0 Johnson John William Johnson_J_W
1 Peterson Andrew James Peterson_A_J
2 Burnham Edward Alexander Burnham_E_A
CodePudding user response:
I would use pandas' apply method, passing a function ('to_initials') that processes each entry in the 'name' column of the dataframe.
def to_initials(x):
last, first, second = x.split(" ")
return last "_" first[0] "_" second[0]
df = pd.DataFrame({"name":["Johnson John William","Peterson Andrew James","Burnham Edward Alexander"]})
df["initials"] = df["name"].apply(to_initials)
or it's possible to use python's lambda functions and do it in one line.
df["initials"] = df["name"].apply(lambda x: x.split(" ")[0] "_" x.split(" ")[1][0] "_" x.split(" ")[2][0])
If there are entries with only one name, you would have to extend this function though.