Home > Blockchain >  How to get the first letter of each word in a string of dataframe column
How to get the first letter of each word in a string of dataframe column

Time:09-20

I have a dataframe column of first and last names. I want to extract the initials from the names as another column in my dataframe. For the following dataframe:

   Name
0 'Brad Pitt'
1 'Bill Gates'
2 'Elon Musk'

I have came up with a solution:

df['initials'] = [df['Name'][i].split()[0][0]   df['Name'][i].split()[1][0] for i in range(len(df))]

However, for a name such as 'John David Smith', this does not work, as I want to have the first letter of each word in a name. Moreover, since my dataframe is quite large, I would like to know if there is a 'vectorized' solution (without for loops).

Thank you in advance.

CodePudding user response:

Use list comprehension if performance is important with split and join:

df['initials'] = [' '.join(y[0] for y in x.split()) for x in df['Name']]
print (df)
         Name initials
0   Brad Pitt      B P
1  Bill Gates      B G
2   Elon Musk      E M

Or:

df['initials'] = df['Name'].apply(lambda x: ' '.join(y[0] for y in x.split()))

Solution with no for, but is is really slow:

df['initials'] = df['Name'].str.split(expand=True).apply(lambda x: x.str[0]).fillna('').agg(' '.join, axis=1).str.rstrip()

Performmance for 400k rows:

print (df)
               Name
0         Brad Pitt
1        Bill Gates
2         Elon Musk
3  John David Smith

df = pd.concat([df] * 100000, ignore_index=True)

Fastest is second and first solution, then first @mozway answer, slowiest is second @mozway solution:

In [178]: %%timeit
     ...: df['initials2'] = df['Name'].apply(lambda x: ' '.join(y[0] for y in x.split()))
     ...: 
442 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [177]: %%timeit
     ...: df['initials1'] = [' '.join(y[0] for y in x.split()) for x in df['Name']]
     ...: 
     ...: 
485 ms ± 7.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [180]: %%timeit
     ...: df['initials'] = df['Name'].str.replace(r'(?<=\w)\w', '', regex=True)
     ...: 
830 ms ± 8.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [179]: %%timeit 
     ...: df['initials3'] = df['Name'].str.split(expand=True).apply(lambda x: x.str[0]).fillna('').agg(' '.join, axis=1).str.rstrip()
     ...: 
18.8 s ± 772 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [181]: %%timeit
     ...: df['initials'] = (df['Name'].str.extractall(r'(?<!\w)(\w)').groupby(level=0).agg(' '.join))                 
     ...: 
     ...: 
25.3 s ± 692 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response:

Quite easy using a short regex:

df['initials'] = df['Name'].str.replace(r'(?<=\w)\w ', '', regex=True)

Alternative:

df['initials'] = (df['Name'].str.extractall(r'(?<!\w)(\w)')
                  .groupby(level=0).agg(' '.join)
                 )

output:

               Name initials
0         Brad Pitt      B P
1        Bill Gates      B G
2         Elon Musk      E M
3  John David Smith    J D S

how is works

solution 1:

  • find each set of letter(s) that is not the first of a word
  • delete it

regex demo

solution 2:

  • find all first letters
  • join them with a space

regex demo

  • Related