Home > Software engineering >  Extract first string of each word in pandas column
Extract first string of each word in pandas column

Time:10-20

I have the DF as below

col1
GRE MET HOCK 38 
ASS COM CORD EMERG  INIT

I would like to create a column with first string of each word from col1 but keeping integer, such as below

col1                        col2
GRE MET HOCK 38             GMH38
ASS COM CORD EMERG  INIT    ACCEI 

I found something that might work, but not giving the expected output

import re
input = "GRE MET HOCK 38"
output = "".join(item[0].upper() for item in re.findall("\w ", input))

CodePudding user response:

split the strings on spaces, then stack into one long Series. Then you can take the first letter, except where the split part of the string isnumeric, and finally join the results and assign back, which aligns on the original DataFrame index.

import pandas as pd
df = pd.DataFrame({'col1': ['GRE MET HOCK 38', 'ASS COM CORD EMERG  INIT']})

s = df['col1'].str.split('\s ', expand=True).stack()
df['col2'] = s.str[0].mask(s.str.isnumeric(), s).groupby(level=0).agg(''.join)

                       col1   col2
0           GRE MET HOCK 38  GMH38
1  ASS COM CORD EMERG  INIT  ACCEI

CodePudding user response:

You can use Series.str.replace:

import pandas as pd
df = pd.DataFrame({'col':['GRE MET HOCK 38', 'ASS COM CORD EMERG  INIT']})
df['col'].str.replace(r'\b(?!\d \b)(\w)\w*|\s ', lambda x: x.group(1).upper() if x.group(1) else '', regex=True)
# => 0    GMH38
#    1    ACCEI
#    Name: col, dtype: object

See the regex demo. Depending on what kind of numbers and what kind of word boundaries you need to support the regex can be adjusted.

The current pattern matches

  • \b(?!\d \b)(\w)\w* - a word boundary, and then one word char (captured into Group 1 and then zero or more word chars, but these word chars should not constitute a digit sequence as a whole
  • | - or
  • \s - one or more whitespaces.

If Group 1 matches, this uppercased value is the replacement, else, the match is removed (the replacement is an empty string).

CodePudding user response:

you can Iterate on column items and extract words with split items by spaces,then create new words with fist letter of words and save the new word to a list,then add this list to a new column of DataFrame

from pandas import DataFrame

data = {
    'col1' : ['GRE MET HOCK 38', 'ASS COM CORD EMERG INIT'],
}

new_column = []

df = DataFrame(data)

for item in df['col1']:
    new_item = ""
    #extract words from item with split by space
    words = item.split()
    for word in words:
        #add first letter to new item
        new_item  = word[0]
    #add new item to new column
    new_column.append(new_item)


#add new column to DataFrame
df['col2'] = new_column
print(df)
  • Related